Using rank() with subset [closed] - r

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I got this data here:
State Abb Region Change
3 Arizona AZ West 24.6
6 Colorado CO West 16.9
10 Florida FL South 17.6
11 Georgia GA South 18.3
13 Idaho ID West 21.1
29 Nevada NV West 35.1
34 North Carolina NC South 18.5
41 South Carolina SC South 15.3
44 Texas TX South 20.6
45 Utah UT West 23.8
I'm trying to extract a subset where Change > 40.
When I use
subset(uspopchange, rank(Change)>40)
it works
but when I use
subset(uspopchange, Change > 40)
it comes up with nothing.
Furthermore, if I use
subset(uspopchange, Change > 16.9)
it works also.
Why does it do that? Why do I need to user rank() to get my subset?
BTW: the data is from
install.packages("gcookbook")

> library(gcookbook)
> data(uspopchange)
> head(uspopchange[order(uspopchange$Change,decreasing=TRUE),])
State Abb Region Change
29 Nevada NV West 35.1
3 Arizona AZ West 24.6
45 Utah UT West 23.8
13 Idaho ID West 21.1
44 Texas TX South 20.6
34 North Carolina NC South 18.5
There are no rows with Change greater than 40. When you are using rank(Change) > 40 in your subset(), it is giving you the rows that, based on the value of Change, have a rank higher than 40. Since there are 50 rows in your data (Change has a length of 50), you are getting the rows that rank 41, 42, 43, ... , 50.
> Top10 <- subset(uspopchange, rank(Change)>40)
> Top10[order(Top10$Change,decreasing=TRUE),]
State Abb Region Change
29 Nevada NV West 35.1
3 Arizona AZ West 24.6
45 Utah UT West 23.8
13 Idaho ID West 21.1
44 Texas TX South 20.6
34 North Carolina NC South 18.5
11 Georgia GA South 18.3
10 Florida FL South 17.6
6 Colorado CO West 16.9
41 South Carolina SC South 15.3
##
> uspopchange[order(uspopchange$Change,decreasing=TRUE),][1:10,]
State Abb Region Change
29 Nevada NV West 35.1
3 Arizona AZ West 24.6
45 Utah UT West 23.8
13 Idaho ID West 21.1
44 Texas TX South 20.6
34 North Carolina NC South 18.5
11 Georgia GA South 18.3
10 Florida FL South 17.6
6 Colorado CO West 16.9
41 South Carolina SC South 15.3
Those are equivalent.

Related

How to make a stratified random sample with panel data in Rstudio?

I want to do a stratified random sample of panel data. How to do it?
Example:
the most similar situation is the dataset Guns, included in the AER package of "R". it has 51 states, 13 variables over 23 years. Here 2 situations:
how to make a stratified random sample of 40 states?
how to make just a random sample of size=40 states?
I tried with this:
set.seed(2)
samp1=strata(Guns, ("levels(Guns$state)"), size=c(40), method = "srswor")
but an error is returned:
Error in strata(Guns, (levels(Guns$state)), size = c(40), method = "srswor") :
the names of the strata are wrong
THANKS!
For random sample do these simple steps
set.seed(2)
x <- sample(unique(Guns$state), 40)
sample <- Guns[Guns$state %in% x,]
> nrow(Guns)
[1] 1173
> nrow(sample)
[1] 920
920/1173 rows selected
check number of states in sample
> length(unique(sample$state))
[1] 40
For stratified sampling within this sample of 40 States say 50% selection per State, follow this code
library(tidyverse)
set.seed(2)
str_sample <- sample %>% group_by(state) %>%
sample_frac(size = 0.5)
If you'll see 480 rows are selected. Check each stratum size
> table(sample$state)
Alabama Alaska Arizona Arkansas California Colorado Connecticut
23 23 23 0 0 23 0
Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois
23 23 0 23 23 23 23
Indiana Iowa Kansas Kentucky Louisiana Maine Maryland
23 23 23 23 23 23 23
Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska
23 23 23 23 0 23 0
Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota
23 23 0 23 23 23 23
Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota
23 23 23 23 0 23 0
Tennessee Texas Utah Vermont Virginia Washington West Virginia
0 23 23 23 23 23 23
Wisconsin Wyoming
0 23
> table(str_sample$state)
Alabama Alaska Arizona Arkansas California Colorado Connecticut
12 12 12 0 0 12 0
Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois
12 12 0 12 12 12 12
Indiana Iowa Kansas Kentucky Louisiana Maine Maryland
12 12 12 12 12 12 12
Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska
12 12 12 12 0 12 0
Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota
12 12 0 12 12 12 12
Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota
12 12 12 12 0 12 0
Tennessee Texas Utah Vermont Virginia Washington West Virginia
0 12 12 12 12 12 12
Wisconsin Wyoming
0 12

Extract out home team and away teams separated by "at" in R

I have a vector of Matchups in college basketball:
c("#34 Colorado at #36 California", "#31 Utah at #87 Stanford",
"#26 USC at #112 Wash State", "#56 UCLA at #134 Washington",
"#187 W Illinois at #116 Neb Omaha", "#222 Denver at #58 S Dakota St",
"#245 IUPUI at #170 South Dakota", "#268 Rice at #208 TX El Paso",
"#274 North Texas at #344 TX-San Ant", "#14 Iowa at #3 Purdue"
)
I'd like two separate vectors: one for the teams before at and the other for teams that appear after at. For ex) first vector would have Colorado, Utah, USC, etc and the second vector would have California, Stanford, Wash State, etc.
Notice how I don't want the # rankings. I just want the team names. I've tried str_spliting, but doesn't work too well since the spacings are all inconsistent.
We can use strsplit and split on "at" which will give us 2 parts of string and from every part we remove "#" followed by number and put it in a dataframe.
data.frame(t(sapply(strsplit(string, "\\bat\\b"),
function(x) trimws(sub("#[0-9]+", "", x)))))
# X1 X2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue
Or using tidyr::separate
tidyr::separate(data.frame(col = trimws(gsub("#[0-9]+", "", string))),
col, into = c("T1", "T2"), sep = "\\bat\\b")
# T1 T2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue
Another solution with str_extract_all()
df <- data.frame(stringsAsFactors = FALSE,
text = c("#34 Colorado at #36 California", "#31 Utah at #87 Stanford",
"#26 USC at #112 Wash State", "#56 UCLA at #134 Washington",
"#187 W Illinois at #116 Neb Omaha", "#222 Denver at #58 S Dakota St",
"#245 IUPUI at #170 South Dakota", "#268 Rice at #208 TX El Paso",
"#274 North Texas at #344 TX-San Ant", "#14 Iowa at #3 Purdue")
)
library(stringr)
library(dplyr)
df %>%
mutate(team_a = str_extract_all(text, "(?<=\\s).+(?=\\s+at)"),
team_b = str_extract_all(text, "(?<=\\d\\s)[^\\d]+$"))
#> text team_a team_b
#> 1 #34 Colorado at #36 California Colorado California
#> 2 #31 Utah at #87 Stanford Utah Stanford
#> 3 #26 USC at #112 Wash State USC Wash State
#> 4 #56 UCLA at #134 Washington UCLA Washington
#> 5 #187 W Illinois at #116 Neb Omaha W Illinois Neb Omaha
#> 6 #222 Denver at #58 S Dakota St Denver S Dakota St
#> 7 #245 IUPUI at #170 South Dakota IUPUI South Dakota
#> 8 #268 Rice at #208 TX El Paso Rice TX El Paso
#> 9 #274 North Texas at #344 TX-San Ant North Texas TX-San Ant
#> 10 #14 Iowa at #3 Purdue Iowa Purdue
Created on 2019-03-29 by the reprex package (v0.2.1)
We can do this in base R by removing substring from the 'text' column and using read.csv
read.csv(text = trimws(gsub("#\\d+", "", gsub("\\s+at\\s+", ",", df$text))),
header = FALSE, col.names = c("T1", "T2"), stringsAsFactors = FALSE)
# T1 T2
#1 Colorado California
#2 Utah Stanford
#3 USC Wash State
#4 UCLA Washington
#5 W Illinois Neb Omaha
#6 Denver S Dakota St
#7 IUPUI South Dakota
#8 Rice TX El Paso
#9 North Texas TX-San Ant
#10 Iowa Purdue

Arrange function in plyr

As I was doing the final assignment of coursera R programming course, my arrange function didnt actually worked well and I am fairly confused
as I have extract the state by split function
my_data<- outcome[,c(2,7,23)]
names(my_data)<- c("hospital","state","heart_attack","heart_failure","pneumonia")
my_data1<-lapply(split(my_data,my_data$state), data.frame)
my_data2<-arrange(my_data1$stateName, penumonia)
> arrange(my_data1$AK, pneumonia)
hospital state pneumonia
1 PROVIDENCE ALASKA MEDICAL CENTER AK 10.5
2 PEACEHEALTH KETCHIKAN MEDICAL CENTER AK 11.3
3 SITKA COMMUNITY HOSPITAL AK 11.5
4 BARTLETT REGIONAL HOSPITAL AK 11.6
5 NORTON SOUND REGIONAL HOSPITAL AK 11.6
6 PROVIDENCE KODIAK ISLAND MEDICAL CTR AK 12.0
7 MAT-SU REGIONAL MEDICAL CENTER AK 12.1
8 SOUTH PENINSULA HOSPITAL AK 12.2
9 ALASKA REGIONAL HOSPITAL AK 12.5
10 FAIRBANKS MEMORIAL HOSPITAL AK 13.4
11 CENTRAL PENINSULA GENERAL HOSPITAL AK 13.8
12 MT EDGECUMBE HOSPITAL AK 14.2
13 ALASKA NATIVE MEDICAL CENTER AK 15.5
14 YUKON KUSKOKWIM DELTA REG HOSPITAL AK 9.7
which obviously
14 YUKON KUSKOKWIM DELTA REG HOSPITAL AK 9.7
should be at the top.
Same result after I transform the column class to numeric.
So I have transformed the pneumonia column with
> class(my_data2$pneumonia)
[1] "character"
> my_data2$pneumonia <-as.numeric(my_data2$pneumonia)
Warning message:
NAs introduced by coercion
> class(my_data2$pneumonia)
[1] "numeric
> arrange(my_data2,pneumonia)
hospital state pneumonia
1 YUKON KUSKOKWIM DELTA REG HOSPITAL AK 9.7
2 PROVIDENCE ALASKA MEDICAL CENTER AK 10.5
3 PEACEHEALTH KETCHIKAN MEDICAL CENTER AK 11.3
4 SITKA COMMUNITY HOSPITAL AK 11.5
5 BARTLETT REGIONAL HOSPITAL AK 11.6
6 NORTON SOUND REGIONAL HOSPITAL AK 11.6
7 PROVIDENCE KODIAK ISLAND MEDICAL CTR AK 12.0
8 MAT-SU REGIONAL MEDICAL CENTER AK 12.1
9 SOUTH PENINSULA HOSPITAL AK 12.2
10 ALASKA REGIONAL HOSPITAL AK 12.5
11 FAIRBANKS MEMORIAL HOSPITAL AK 13.4
12 CENTRAL PENINSULA GENERAL HOSPITAL AK 13.8
13 MT EDGECUMBE HOSPITAL AK 14.2
14 ALASKA NATIVE MEDICAL CENTER AK 15.5
15 PROVIDENCE VALDEZ MEDICAL CENTER AK NA
16 PROVIDENCE SEWARD HOSPITAL AK NA
17 CORDOVA COMMUNITY MEDICAL CENTER AK NA
and the problem solved, thanks for comment

removing duplicate/repeating values in the same data frame column in R

I have a weird data frame where the Player column has the names of the players. The problem is that the first name is shown twice. So Roy Sievers is RoyRoy Sievers, and I want the name to obviously be Roy Sievers.
Would anybody know how to do this?
Here is the full data frame, it's not very long:
Year Player Team Position
1 1949 RoyRoy Sievers St. Louis Browns OF
2 1950 WaltWalt Dropo Boston Red Sox 1B
3 1951 GilGil McDougald New York Yankees 3B
4 1952 HarryHarry Byrd Philadelphia Athletics P
5 1953 HarveyHarvey Kuenn Detroit Tigers SS
6 1954 BobBob Grim New York Yankees P
7 1955 HerbHerb Score Cleveland Indians P
8 1956 LuisLuis Aparicio Chicago White Sox SS
9 1957 TonyTony Kubek New York Yankees SS
10 1958 AlbieAlbie Pearson Washington Senators OF
11 1959 BobBob Allison Washington Senators OF
12 1960 RonRon Hansen Baltimore Orioles SS
13 1961 DonDon Schwall Boston Red Sox P
14 1962 TomTom Tresh New York Yankees SS
15 1963 GaryGary Peters Chicago White Sox P
16 1964 TonyTony Oliva Minnesota Twins OF
17 1965 CurtCurt Blefary Baltimore Orioles OF
18 1966 TommieTommie Agee Chicago White Sox OF
19 1967 RodRod Carew Minnesota Twins 2B
20 1968 StanStan Bahnsen New York Yankees P
21 1969 LouLou Piniella Kansas City Royals OF
22 1970 ThurmanThurman Munson New York Yankees C
23 1971 ChrisChris Chambliss Cleveland Indians 1B
24 1972 CarltonCarlton Fisk Boston Red Sox C
25 1973 AlAl Bumbry Baltimore Orioles OF
26 1974 MikeMike Hargrove Texas Rangers 1B
27 1975 FredFred Lynn Boston Red Sox OF
28 1976 MarkMark Fidrych Detroit Tigers P
29 1977 EddieEddie Murray Baltimore Orioles DH
30 1978 LouLou Whitaker Detroit Tigers 2B
31 1979* JohnJohn Castino Minnesota Twins 3B
32 1979* AlfredoAlfredo Griffin Toronto Blue Jays SS
33 1980 JoeJoe Charboneau Cleveland Indians OF
34 1981 DaveDave Righetti New York Yankees P
35 1982 CalCal Ripken Baltimore Orioles SS
36 1983 RonRon Kittle Chicago White Sox OF
37 1984 AlvinAlvin Davis Seattle Mariners 1B
38 1985 OzzieOzzie Guillén Chicago White Sox SS
39 1986 JoseJose Canseco Oakland Athletics OF
40 1987 MarkMark McGwire Oakland Athletics 1B
41 1988 WaltWalt Weiss Oakland Athletics SS
42 1989 GreggGregg Olson Baltimore Orioles P
43 1990 Sandy Alomar Jr Cleveland Indians C
44 1991 ChuckChuck Knoblauch Minnesota Twins 2B
45 1992 PatPat Listach Milwaukee Brewers SS
46 1993 TimTim Salmon California Angels OF
47 1994 BobBob Hamelin Kansas City Royals DH
48 1995 MartyMarty Cordova Minnesota Twins OF
49 1996 DerekDerek Jeter New York Yankees SS
50 1997 NomarNomar Garciaparra Boston Red Sox SS
51 1998 BenBen Grieve Oakland Athletics OF
52 1999 CarlosCarlos Beltrán Kansas City Royals OF
53 2000 KazuhiroKazuhiro Sasaki Seattle Mariners P
54 2001 IchiroIchiro Suzuki Seattle Mariners OF
55 2002 EricEric Hinske Toronto Blue Jays 3B
56 2003 ÁngelÁngel Berroa Kansas City Royals SS
57 2004 BobbyBobby Crosby Oakland Athletics SS
58 2005 HustonHuston Street Oakland Athletics P
59 2006 JustinJustin Verlander Detroit Tigers P
60 2007 DustinDustin Pedroia Boston Red Sox 2B
61 2008 EvanEvan Longoria Tampa Bay Rays 3B
62 2009 Andrew Bailey Oakland Athletics P
63 2010 NeftalíNeftalí Feliz Texas Rangers P
64 2011 JeremyJeremy Hellickson Tampa Bay Rays P
65 2012 MikeMike Trout Los Angeles Angels OF
66 2013 WilWil Myers Tampa Bay Rays OF
67 2014 JoséJosé Abreu Chicago White Sox 1B
68 2015 CarlosCarlos Correa Houston Astros SS
69 2016 MichaelMichael Fulmer Detroit Tigers P
You can fix this by finding a repeated pattern of at least three letters and replacing it with one copy like this:
gsub("(\\w{3,})\\1", "\\1", Players$Player)
If you want to overwrite the old version, just
Players$Player = gsub("(\\w{3,})\\1", "\\1", Players$Player)
G5W's answer gets you most of the way there, but would miss two-letter first names like "Al". This version relies on capitalization, and not character count:
myData$Player <- gsub('([A-Z][a-z]+)\\1', '\\1', myData$Player)
For the not so regex savvy---
library(stringr)
fun1<-function(string){
g<-str_split(g," ")
h<-str_length(m<-g[[1]][1])
l<-str_sub(m,start = 1,end = h/2)
return(paste(l,g[[1]][2]))
}
fun1(df$Player)

Can't remove a row from a matrix in R

I'm trying to remove an outlier from a data matrix. The original matrix is called Westdata and I want to remove row 51.
I've tried the following line of code but it doesn't remove the outlier and the new matrix is identical to the old one.
Westdata.Outlier<-Westdata[-51,]
Westdata.Outlier
State Region Pay Spend Area
20 Mont. MN 22.5 3.95 West
21 Wyo. MN 27.2 5.44 West
22 N.Mex. MN 22.6 3.40 West
23 Utah MN 22.3 2.30 West
24 Wash. PA 26.0 3.71 West
25 Calif. PA 29.1 3.61 West
26 Hawaii PA 25.8 3.77 West
46 Idaho MN 21.0 2.51 West
47 Colo. MN 25.9 4.04 West
48 Ariz. MN 26.6 2.83 West
49 Nev. MN 25.6 2.93 West
50 Oreg. PA 25.8 4.12 West
51 Alaska PA 41.5 8.35 West
Any suggestions?

Resources