I am trying to create a Sankey diagram in R, using the googleVis package. (The data.frame that I am using can be found below) What I want the diagram to do is go from the Type, to the Organization, then to the team (Tm), while the size represents the number of (Name) players. From what I have read, one can only three columns. I, therefore, did that using this code
BrewersDraft <- sqldf("SELECT Type, Organization, COUNT(Name) AS PLAYERS
FROM df
GROUP BY 1,2
UNION ALL
SELECT Type, (Tm) AS MLB_TEAM, COUNT(Name) AS PLAYERS
FROM df
GROUP BY 1,2")
The data now looks like this:
Type Organization
1 College/University Bradley University (Peoria, IL)
2 College/University California State University Fullerton (Fullerton, CA)
3 College/University Clemson University (Clemson, SC)
4 College/University East Tennessee State University (Johnson City, TN)
5 College/University Faulkner University (Montgomery, AL)
6 College/University Felician College (Lodi, NJ)
PLAYERS
1 1
2 1
3 1
4 1
5 1
6 1
The "Brewers" value is also in the Organization value. Then I used this code to create the Sankey Diagram:
plot(gvisSankey(BrewersDraft, from = "Type", to="Organization_Type", weight = "PLAYERS",
options = list(height=800, width=850,
sankey="{
link:{color:{fill: 'lightblue'}}}")))
The problem is that the Brewers value in the Sankey diagram is with all of the Organization variables when I want the Organization variables to flow to the Brewers variable.
It should look similar to the example on this website, https://thedatagame.com.au/2015/12/14/visualising-the-2015-nba-draft-in-r/
Only difference being that all of the Organization is only going to one team, instead of many.
Can anybody help me? Thank you, it would be much appreciated.
The original data frame.
Year Rnd OvPck RdPck Tm Name Pos
1 2016 1 5 5 Brewers Corey Ray (minors) OF
2 2016 2 46 5 Brewers Lucas Erceg (minors) 3B
3 2016 2 75 34 Brewers Mario Feliciano (minors) C
4 2016 3 82 5 Brewers Braden Webb (minors) RHP
5 2016 4 111 5 Brewers Corbin Burnes (minors) RHP
6 2016 5 141 5 Brewers Zack Brown (minors) RHP
7 2016 6 171 5 Brewers Payton Henry (minors) C
8 2016 7 201 5 Brewers Daniel Brown (minors) LHP
9 2016 8 231 5 Brewers Francisco Thomas (minors) SS
10 2016 9 261 5 Brewers Trey York (minors) 2B
11 2016 10 291 5 Brewers Blake Fox (minors) LHP
12 2016 11 321 5 Brewers Chad McClanahan (minors) 3B
13 2016 12 351 5 Brewers Trever Morrison (minors) SS
14 2016 13 381 5 Brewers Thomas Jankins (minors) RHP
15 2016 14 411 5 Brewers Gabriel Garcia (minors) C
16 2016 15 441 5 Brewers Scott Serigstad (minors) RHP
17 2016 16 471 5 Brewers Louie Crow (minors) RHP
18 2016 17 501 5 Brewers Weston Wilson (minors) 3B
19 2016 18 531 5 Brewers Cooper Hummel (minors) C
20 2016 19 561 5 Brewers Zach Clark (minors) CF
21 2016 20 591 5 Brewers Jared Horn (minors) RHP
22 2016 21 621 5 Brewers Nathan Rodriguez (minors) C
23 2016 22 651 5 Brewers Cam Roegner (minors) LHP
24 2016 23 681 5 Brewers Ronnie Gideon (minors) 1B
25 2016 24 711 5 Brewers Michael Gonzalez (minors) RHP
26 2016 25 741 5 Brewers Blake Lillis (minors) LHP
27 2016 26 771 5 Brewers Nick Roscetti (minors) SS
28 2016 27 801 5 Brewers Nick Cain (minors) RF
29 2016 28 831 5 Brewers Andrew Vernon (minors) RHP
30 2016 29 861 5 Brewers Brennan Price (minors) RHP
31 2016 30 891 5 Brewers Dalton Brown (minors) RHP
32 2016 31 921 5 Brewers Ryan Aguilar (minors) 1B
33 2016 32 951 5 Brewers Wilson Adams (minors) RHP
34 2016 33 981 5 Brewers Emerson Gibbs (minors) RHP
35 2016 34 1011 5 Brewers Matt Smith (minors) RHP
36 2016 35 1041 5 Brewers Chase Williams (minors) RHP
37 2016 36 1071 5 Brewers Parker Bean (minors) RHP
38 2016 37 1101 5 Brewers Jomar Cortes (minors) SS
39 2016 38 1131 5 Brewers Caleb Whalen (minors) CF
40 2016 39 1161 5 Brewers Jose Gomez (minors) CF
41 2016 40 1191 5 Brewers Kyle Serrano (minors) RHP
Type Organization
1 College/University University of Louisville (Louisville, KY)
2 College/University Menlo College (Atherton, CA)
3 High School Carlos Beltran Baseball Academy (Florida, PR)
4 College/University University of South Carolina (Columbia, SC)
5 College/University St. Mary's College of California (Moraga, CA)
6 College/University University of Kentucky (Lexington, KY)
7 High School Pleasant Grove HS (Pleasant Grove, UT)
8 College/University Mississippi State University (Mississippi State, MS)
9 High School Osceola HS (Kissimmee, FL)
10 College/University East Tennessee State University (Johnson City, TN)
11 College/University Rice University (Houston, TX)
12 High School Brophy College Preparatory (Phoenix, AZ)
13 College/University Oregon State University (Corvallis, OR)
14 College/University Quinnipiac College (Hamden, CT)
15 Junior College Broward Community College (Fort Lauderdale, FL)
16 College/University California State University Fullerton (Fullerton, CA)
17 High School Buena Park HS (Buena Park, CA)
18 College/University Clemson University (Clemson, SC)
19 College/University University of Portland (Portland, OR)
20 Junior College Pearl River Community College (Poplarville, MS)
21 High School Vintage HS (Napa, CA)
22 Junior College Cypress College (Cypress, CA)
23 College/University Bradley University (Peoria, IL)
24 College/University Texas A&M University (College Station, TX)
25 High School Norwalk HS (Norwalk, CT)
26 High School St. Thomas Aquinas HS (Overland Park, KS)
27 College/University University of Iowa (Iowa City, IA)
28 College/University Faulkner University (Montgomery, AL)
29 College/University North Carolina Central University (Durham, NC)
30 College/University Felician College (Lodi, NJ)
31 College/University Texas Tech University (Lubbock, TX)
32 College/University University of Arizona (Tucson, AZ)
33 College/University University of Alabama in Huntsville (Huntsville, AL)
34 College/University Tulane University (New Orleans, LA)
35 College/University Georgetown University (Washington, DC)
36 College/University Wichita State University (Wichita, KS)
37 College/University Liberty University (Lynchburg, VA)
38 High School Carlos Beltran Baseball Academy (Florida, PR)
39 College/University University of Portland (Portland, OR)
40 College/University St. Thomas University (Miami Gardens, FL)
41 College/University University of Tennessee (Knoxville, TN)
If I understand correctly you have 3 states: type, organization and team. Type is always the origin, team is the final destination and organization is at first a destination and then an origin.
In the second SQL statement you use "Type" again as the origin, when the origin should be "Organization".
Your SQL has to be modified to look like this:
BrewersDraft <- sqldf("SELECT Type, Organization, COUNT(Name) AS PLAYERS
FROM df
GROUP BY 1,2
UNION ALL
SELECT Organization, (Tm) AS MLB_TEAM, COUNT(Name) AS PLAYERS
FROM df
GROUP BY 1,2")
Related
I am trying to merge two dataframes in r, and this error message keeps coming up even though the variable types all should be correct.
Here is my code:
team_info <- baseballr::mlb_teams(season = 2022)
team_info_mlb <- subset(team_info, sport_name == 'Major League Baseball')
tim2 <- team_info_mlb %>%
rename('home_team' = club_name)
tim3 <- subset(tim2, select = c('team_full_name', 'home_team'))
new_pf <- baseballr::fg_park(yr = 2022)
new_pf <- subset(new_pf, select = c('home_team', '1yr'))
info_pf <- merge(tim3, new_pf, by = 'home_team')
The final line is where the problems happen. Let me know if anyone has advice.
The problem is that the data have some fancy class attributes.
> class(tim3)
[1] "baseballr_data" "tbl_df" "tbl" "data.table" "data.frame"
> class(new_pf)
[1] "baseballr_data" "tbl_df" "tbl" "data.table" "data.frame"
Just wrap them in as.data.frame(). Since both data sets have the same by variable you may omit explicit specification.
info_pf <- merge(as.data.frame(tim3), as.data.frame(new_pf))
info_pf
# home_team team_full_name 1yr
# 1 Angels Los Angeles Angels 102
# 2 Astros Houston Astros 99
# 3 Athletics Oakland Athletics 94
# 4 Blue Jays Toronto Blue Jays 106
# 5 Braves Atlanta Braves 105
# 6 Brewers Milwaukee Brewers 102
# 7 Cardinals St. Louis Cardinals 92
# 8 Cubs Chicago Cubs 103
# 9 Diamondbacks Arizona Diamondbacks 103
# 10 Dodgers Los Angeles Dodgers 98
# 11 Giants San Francisco Giants 99
# 12 Guardians Cleveland Guardians 97
# 13 Mariners Seattle Mariners 94
# 14 Marlins Miami Marlins 97
# 15 Mets New York Mets 91
# 16 Nationals Washington Nationals 97
# 17 Orioles Baltimore Orioles 108
# 18 Padres San Diego Padres 96
# 19 Phillies Philadelphia Phillies 98
# 20 Pirates Pittsburgh Pirates 101
# 21 Rangers Texas Rangers 98
# 22 Rays Tampa Bay Rays 89
# 23 Red Sox Boston Red Sox 111
# 24 Reds Cincinnati Reds 112
# 25 Rockies Colorado Rockies 112
# 26 Royals Kansas City Royals 108
# 27 Tigers Detroit Tigers 94
# 28 Twins Minnesota Twins 99
# 29 White Sox Chicago White Sox 100
# 30 Yankees New York Yankees 99
I want to do a stratified random sample of panel data. How to do it?
Example:
the most similar situation is the dataset Guns, included in the AER package of "R". it has 51 states, 13 variables over 23 years. Here 2 situations:
how to make a stratified random sample of 40 states?
how to make just a random sample of size=40 states?
I tried with this:
set.seed(2)
samp1=strata(Guns, ("levels(Guns$state)"), size=c(40), method = "srswor")
but an error is returned:
Error in strata(Guns, (levels(Guns$state)), size = c(40), method = "srswor") :
the names of the strata are wrong
THANKS!
For random sample do these simple steps
set.seed(2)
x <- sample(unique(Guns$state), 40)
sample <- Guns[Guns$state %in% x,]
> nrow(Guns)
[1] 1173
> nrow(sample)
[1] 920
920/1173 rows selected
check number of states in sample
> length(unique(sample$state))
[1] 40
For stratified sampling within this sample of 40 States say 50% selection per State, follow this code
library(tidyverse)
set.seed(2)
str_sample <- sample %>% group_by(state) %>%
sample_frac(size = 0.5)
If you'll see 480 rows are selected. Check each stratum size
> table(sample$state)
Alabama Alaska Arizona Arkansas California Colorado Connecticut
23 23 23 0 0 23 0
Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois
23 23 0 23 23 23 23
Indiana Iowa Kansas Kentucky Louisiana Maine Maryland
23 23 23 23 23 23 23
Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska
23 23 23 23 0 23 0
Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota
23 23 0 23 23 23 23
Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota
23 23 23 23 0 23 0
Tennessee Texas Utah Vermont Virginia Washington West Virginia
0 23 23 23 23 23 23
Wisconsin Wyoming
0 23
> table(str_sample$state)
Alabama Alaska Arizona Arkansas California Colorado Connecticut
12 12 12 0 0 12 0
Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois
12 12 0 12 12 12 12
Indiana Iowa Kansas Kentucky Louisiana Maine Maryland
12 12 12 12 12 12 12
Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska
12 12 12 12 0 12 0
Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota
12 12 0 12 12 12 12
Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota
12 12 12 12 0 12 0
Tennessee Texas Utah Vermont Virginia Washington West Virginia
0 12 12 12 12 12 12
Wisconsin Wyoming
0 12
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have the following dataset for California housing data:
head(calif_cluster,15)
MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms Population
1 190300 4.20510 16 2697.00 490.00 1462
2 150800 2.54810 33 2821.00 652.00 1206
3 252600 6.08290 17 6213.20 1276.05 3288
4 269700 4.03680 52 919.00 213.00 413
5 91200 1.63680 28 3072.00 790.00 1375
6 66200 2.18980 30 744.00 156.00 410
7 148800 2.63640 39 620.95 136.00 348
8 384800 4.46150 20 2270.00 498.00 1070
9 153200 2.75000 22 1931.00 445.00 1009
10 66200 1.60057 36 973.00 219.00 613
11 461500 3.78130 43 3070.00 668.00 1240
12 144600 2.85000 22 5175.00 1213.00 2804
13 143700 5.09410 8 6213.20 1276.05 3288
14 195500 5.30620 16 2918.00 444.00 1697
15 268800 2.42110 22 620.95 136.00 348
Households Latitude Longitude cluster_kmeans gender_dom marital race edu_level rental
1 515 38.48 -122.47 1 M other black jrcollege rented
2 640 38.00 -122.13 1 F other hispanic doctorate owned
3 1162 33.88 -117.79 3 M other white jrcollege owned
4 193 37.85 -122.25 1 M single others jrcollege owned
5 705 38.13 -122.26 1 F single white doctorate rented
6 165 38.96 -122.21 1 F single others jrcollege owned
7 125 34.01 -118.18 2 M married others postgrad owned
8 521 33.83 -118.38 2 F single white graduate rented
9 407 38.95 -121.04 1 M married others postgrad leased
10 187 35.34 -119.01 2 M single hispanic doctorate owned
11 646 33.76 -118.12 2 F other others highschl leased
12 1091 37.95 -122.05 3 M other white graduate rented
13 1162 36.87 -119.75 3 M other others postgrad leased
14 444 32.93 -117.13 2 M other asian jrcollege owned
15 125 37.71 -120.98 1 F single asian postgrad leased
As i have latitude & longitude information in the datasets, i would like to extract corresponding county for the given geo information using R. Also is it possible to getting the capital city(or largest city) for each of the extracted counties .These could make my stratified analysis more insightful;intend to do some clustering/mapping exercise.
take a look at ggmap::revgeocode
code
library(ggmap)
revgeocode(c(-122.47,38.48)) # longitude then latitude
# [1] "2233 Sulphur Springs Ave, St Helena, CA 94574, USA"
library(dplyr)
library(magrittr)
df12 %<>% rowwise %>% mutate(address = revgeocode(c(Longitude,Latitude))) %>% ungroup # add full address using google api through ggmap
df12 %<>% separate(address,c("street_address", "city","county","country"),remove=F,sep=",") # structure all the info you need
result
df12 %>% select(Longitude,Latitude,address,county)
# A tibble: 15 x 4
# Longitude Latitude address county
# * <dbl> <dbl> <chr> <chr>
# 1 -122.47 38.48 2233 Sulphur Springs Ave, St Helena, CA 94574, USA CA 94574
# 2 -122.13 38.00 3400-3410 Brookside Dr, Martinez, CA 94553, USA CA 94553
# 3 -117.79 33.88 19721 Bluefield Plaza, Yorba Linda, CA 92886, USA CA 92886
# 4 -122.25 37.85 6365 Florio St, Oakland, CA 94618, USA CA 94618
# 5 -122.26 38.13 119 Mimosa Ct, Vallejo, CA 94589, USA CA 94589
# 6 -122.21 38.96 Unnamed Road, Arbuckle, CA 95912, USA CA 95912
# 7 -118.18 34.01 4360-4414 Noakes St, Los Angeles, CA 90023, USA CA 90023
# 8 -118.38 33.83 903 Serpentine St, Redondo Beach, CA 90277, USA CA 90277
# 9 -121.04 38.95 14666-14690 Musso Rd, Auburn, CA 95603, USA CA 95603
# 10 -119.01 35.34 800 Ming Ave, Bakersfield, CA 93307, USA CA 93307
# 11 -118.12 33.76 6211-6295 E Marina Dr, Long Beach, CA 90803, USA CA 90803
# 12 -122.05 37.95 1120 Carey Dr, Concord, CA 94520, USA CA 94520
# 13 -119.75 36.87 1815-1899 E Pryor Dr, Fresno, CA 93720, USA CA 93720
# 14 -117.13 32.93 9010-9016 Danube Ln, San Diego, CA 92126, USA CA 92126
# 15 -120.98 37.71 748-1298 Claribel Rd, Modesto, CA 95356, USA CA 95356
data
df1 <- read.table(text = "MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms Population
1 190300 4.20510 16 2697.00 490.00 1462
2 150800 2.54810 33 2821.00 652.00 1206
3 252600 6.08290 17 6213.20 1276.05 3288
4 269700 4.03680 52 919.00 213.00 413
5 91200 1.63680 28 3072.00 790.00 1375
6 66200 2.18980 30 744.00 156.00 410
7 148800 2.63640 39 620.95 136.00 348
8 384800 4.46150 20 2270.00 498.00 1070
9 153200 2.75000 22 1931.00 445.00 1009
10 66200 1.60057 36 973.00 219.00 613
11 461500 3.78130 43 3070.00 668.00 1240
12 144600 2.85000 22 5175.00 1213.00 2804
13 143700 5.09410 8 6213.20 1276.05 3288
14 195500 5.30620 16 2918.00 444.00 1697
15 268800 2.42110 22 620.95 136.00 348",header=T,stringsAsFactors=F)
df2 <- read.table(text = "Households Latitude Longitude cluster_kmeans gender_dom marital race edu_level rental
1 515 38.48 -122.47 1 M other black jrcollege rented
2 640 38.00 -122.13 1 F other hispanic doctorate owned
3 1162 33.88 -117.79 3 M other white jrcollege owned
4 193 37.85 -122.25 1 M single others jrcollege owned
5 705 38.13 -122.26 1 F single white doctorate rented
6 165 38.96 -122.21 1 F single others jrcollege owned
7 125 34.01 -118.18 2 M married others postgrad owned
8 521 33.83 -118.38 2 F single white graduate rented
9 407 38.95 -121.04 1 M married others postgrad leased
10 187 35.34 -119.01 2 M single hispanic doctorate owned
11 646 33.76 -118.12 2 F other others highschl leased
12 1091 37.95 -122.05 3 M other white graduate rented
13 1162 36.87 -119.75 3 M other others postgrad leased
14 444 32.93 -117.13 2 M other asian jrcollege owned
15 125 37.71 -120.98 1 F single asian postgrad leased",header=T,stringsAsFactors=F)
df12 <- cbind(df1,df2)
I don't think the library offers an option to get the capital or largest city in the county but I think you won't have too much trouble building a lookup table from online info.
I have a weird data frame where the Player column has the names of the players. The problem is that the first name is shown twice. So Roy Sievers is RoyRoy Sievers, and I want the name to obviously be Roy Sievers.
Would anybody know how to do this?
Here is the full data frame, it's not very long:
Year Player Team Position
1 1949 RoyRoy Sievers St. Louis Browns OF
2 1950 WaltWalt Dropo Boston Red Sox 1B
3 1951 GilGil McDougald New York Yankees 3B
4 1952 HarryHarry Byrd Philadelphia Athletics P
5 1953 HarveyHarvey Kuenn Detroit Tigers SS
6 1954 BobBob Grim New York Yankees P
7 1955 HerbHerb Score Cleveland Indians P
8 1956 LuisLuis Aparicio Chicago White Sox SS
9 1957 TonyTony Kubek New York Yankees SS
10 1958 AlbieAlbie Pearson Washington Senators OF
11 1959 BobBob Allison Washington Senators OF
12 1960 RonRon Hansen Baltimore Orioles SS
13 1961 DonDon Schwall Boston Red Sox P
14 1962 TomTom Tresh New York Yankees SS
15 1963 GaryGary Peters Chicago White Sox P
16 1964 TonyTony Oliva Minnesota Twins OF
17 1965 CurtCurt Blefary Baltimore Orioles OF
18 1966 TommieTommie Agee Chicago White Sox OF
19 1967 RodRod Carew Minnesota Twins 2B
20 1968 StanStan Bahnsen New York Yankees P
21 1969 LouLou Piniella Kansas City Royals OF
22 1970 ThurmanThurman Munson New York Yankees C
23 1971 ChrisChris Chambliss Cleveland Indians 1B
24 1972 CarltonCarlton Fisk Boston Red Sox C
25 1973 AlAl Bumbry Baltimore Orioles OF
26 1974 MikeMike Hargrove Texas Rangers 1B
27 1975 FredFred Lynn Boston Red Sox OF
28 1976 MarkMark Fidrych Detroit Tigers P
29 1977 EddieEddie Murray Baltimore Orioles DH
30 1978 LouLou Whitaker Detroit Tigers 2B
31 1979* JohnJohn Castino Minnesota Twins 3B
32 1979* AlfredoAlfredo Griffin Toronto Blue Jays SS
33 1980 JoeJoe Charboneau Cleveland Indians OF
34 1981 DaveDave Righetti New York Yankees P
35 1982 CalCal Ripken Baltimore Orioles SS
36 1983 RonRon Kittle Chicago White Sox OF
37 1984 AlvinAlvin Davis Seattle Mariners 1B
38 1985 OzzieOzzie Guillén Chicago White Sox SS
39 1986 JoseJose Canseco Oakland Athletics OF
40 1987 MarkMark McGwire Oakland Athletics 1B
41 1988 WaltWalt Weiss Oakland Athletics SS
42 1989 GreggGregg Olson Baltimore Orioles P
43 1990 Sandy Alomar Jr Cleveland Indians C
44 1991 ChuckChuck Knoblauch Minnesota Twins 2B
45 1992 PatPat Listach Milwaukee Brewers SS
46 1993 TimTim Salmon California Angels OF
47 1994 BobBob Hamelin Kansas City Royals DH
48 1995 MartyMarty Cordova Minnesota Twins OF
49 1996 DerekDerek Jeter New York Yankees SS
50 1997 NomarNomar Garciaparra Boston Red Sox SS
51 1998 BenBen Grieve Oakland Athletics OF
52 1999 CarlosCarlos Beltrán Kansas City Royals OF
53 2000 KazuhiroKazuhiro Sasaki Seattle Mariners P
54 2001 IchiroIchiro Suzuki Seattle Mariners OF
55 2002 EricEric Hinske Toronto Blue Jays 3B
56 2003 ÁngelÁngel Berroa Kansas City Royals SS
57 2004 BobbyBobby Crosby Oakland Athletics SS
58 2005 HustonHuston Street Oakland Athletics P
59 2006 JustinJustin Verlander Detroit Tigers P
60 2007 DustinDustin Pedroia Boston Red Sox 2B
61 2008 EvanEvan Longoria Tampa Bay Rays 3B
62 2009 Andrew Bailey Oakland Athletics P
63 2010 NeftalíNeftalí Feliz Texas Rangers P
64 2011 JeremyJeremy Hellickson Tampa Bay Rays P
65 2012 MikeMike Trout Los Angeles Angels OF
66 2013 WilWil Myers Tampa Bay Rays OF
67 2014 JoséJosé Abreu Chicago White Sox 1B
68 2015 CarlosCarlos Correa Houston Astros SS
69 2016 MichaelMichael Fulmer Detroit Tigers P
You can fix this by finding a repeated pattern of at least three letters and replacing it with one copy like this:
gsub("(\\w{3,})\\1", "\\1", Players$Player)
If you want to overwrite the old version, just
Players$Player = gsub("(\\w{3,})\\1", "\\1", Players$Player)
G5W's answer gets you most of the way there, but would miss two-letter first names like "Al". This version relies on capitalization, and not character count:
myData$Player <- gsub('([A-Z][a-z]+)\\1', '\\1', myData$Player)
For the not so regex savvy---
library(stringr)
fun1<-function(string){
g<-str_split(g," ")
h<-str_length(m<-g[[1]][1])
l<-str_sub(m,start = 1,end = h/2)
return(paste(l,g[[1]][2]))
}
fun1(df$Player)
Here is my data
x i
1 D W MCMILLAN MEMORIAL HOSPITAL AL
2 <NA> AK
3 JOHN C LINCOLN DEER VALLEY HOSPITAL AZ
4 ARKANSAS METHODIST MEDICAL CENTER AR
5 SHERMAN OAKS HOSPITAL CA
6 SKY RIDGE MEDICAL CENTER CO
7 MIDSTATE MEDICAL CENTER CT
8 <NA> DE
9 <NA> DC
10 SOUTH FLORIDA BAPTIST HOSPITAL FL
11 UPSON REGIONAL MEDICAL CENTER GA
12 <NA> HI
13 LOST RIVERS DISTRICT HOSPITAL ID
14 JESSE BROWN VA MEDICAL CENTER - VA CHICAGO HEALTHCARE SYSTEM IL
15 COMMUNITY HOSPITAL IN
16 COVENANT MEDICAL CENTER IA
17 COFFEYVILLE REGIONAL MEDICAL CENTER KS
18 KING'S DAUGHTERS' MEDICAL CENTER KY
19 NORTH OAKS MEDICAL CENTER, LLC LA
20 RUMFORD HOSPITAL ME
21 CIVISTA MEDICAL CENTER MD
22 HEYWOOD HOSPITAL MA
23 GENESYS REGIONAL MEDICAL CENTER - HEALTH PARK MI
24 HEALTHEAST WOODWINDS HOSPITAL MN
25 MARION GENERAL HOSPITAL MS
26 LIBERTY HOSPITAL MO
27 FRANCES MAHON DEACONESS HOSPITAL MT
28 ALEGENT HEALTH MEMORIAL HOSPITAL NE
29 BANNER CHURCHILL COMMUNITY HOSPITAL NV
30 FRANKLIN REGIONAL HOSPITAL NH
31 CAPITAL HEALTH MEDICAL CENTER - HOPEWELL NJ
32 ESPANOLA HOSPITAL NM
33 METROPOLITAN HOSPITAL CENTER NY
34 MEDWEST HAYWOOD NC
35 LISBON AREA HEALTH SERVICES ND
36 CINCINNATI VA MEDICAL CENTER OH
37 JACKSON COUNTY MEMORIAL HOSPITAL OK
38 ST ALPHONSUS MEDICAL CENTER - BAKER CITY, INC OR
39 UPMC PASSAVANT PA
40 HOSPITAL METROPOLITANO DR TITO MATTEI PR
41 <NA> RI
42 PALMETTO HEALTH BAPTIST SC
43 BLACK HILLS SURGICAL HOSPITAL LLP SD
44 INDIAN PATH MEDICAL CENTER TN
45 NIX HEALTH CARE SYSTEM TX
46 BEAR RIVER VALLEY HOSPITAL UT
47 <NA> VT
48 <NA> VI
49 CARILION GILES COMMUNITY HOSPITAL VA
50 SWEDISH MEDICAL CENTER WA
51 PLATEAU MEDICAL CENTER WV
52 ST CROIX REG MED CTR WI
53 POWELL VALLEY HOSPITAL WY
54 <NA> GU
I want to order this list by column i, but for some reason it throws GU at the bottom.
When I run
order(z$i)
(z is my table)
I get this as a result
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
> str(z)
'data.frame': 54 obs. of 2 variables:
$ x: Factor w/ 46 levels "D W MCMILLAN MEMORIAL HOSPITAL",..: 1 NA 2 3 4 5 6 NA NA 7 ...
$ i: Factor w/ 54 levels "AL","AK","AZ",..: 1 2 3 4 5 6 7 8 9 10 ...
Which to me means that it thinks that GU belongs at the bottom of the list. Also there is a problem at the top of the list, AL is before AK and AZ is before AR.
Any suggestion why it would do this?
Thanks
z[order(as.character(z$i)), ]
will do the trick.