I'm currently learning R with help of video on coursera. When trying to exclude all hospital of state which have less than 20 hospital form table, I couldn't able to find correct solution with lack of programming knowledge of R (as I had program lots with C, Logic I tried to implemented in R is also like C)
Code I had used is like
>test <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
>test[, 11] <- as.numeric(outcome[, 11])
>test2 <- table(outcome$State)
Here from table test2, I can get the value of particular row as test2[[2]] but couldn't able to find out how to use conditional logic to get state with less then 20 hospital (If i get the state name then I can use subset() to address actual problem). Also I had look on dimnames() function but could find out any idea to solve my problem. So my question is, in R how could I check the threshold value with table value.
Value store in test2 is
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37
MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX
134 133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370
UT VA VI VT WA WI WV WY ##State Name
42 87 2 15 88 125 54 29 ##Count of Hospital
as Arun also specified on his comment... you can do it as names(test2[test2 >= 20]) in order to get state with higher than 20 Hospital... Here is nice explanation why you have to avoid subset.
Or yo can transform your table to a data.frame and use subset
dat <- as.data.frame(test2)
subset(dat, Freq < 20)
nn Freq
1 AK 17
8 DC 8
9 DE 6
12 GU 1
13 HI 19
42 RI 12
49 VI 2
50 VT 15
Related
So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations
My code works 95% correctly but I am not sure why my graph has empty white spaces for certain states. For example Washington state has a count of 152 but it appears NULL?
txt <- "AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT
34 103 78 241 789 200 18 18 13 355 210 26 36 48 119 106 57 98 104 32 81 26 92 62 136 65 34 164 10 30 16 70 107 100 109 151 150 97 113 3 90 15 158 479 68 95 7
WA WI WV WY
152 96 48 14 "
dat <- stack(read.table(text = txt, header = TRUE, fill = TRUE))
names(dat)[2] <-'state.abb'
dat$states <- tolower(state.name[match(dat$state.abb, state.abb)])
mapUSA <- map('state', fill = TRUE, plot = FALSE)
nms <- sapply(strsplit(mapUSA$names, ':'), function(x)x[1])
USApolygons <- map2SpatialPolygons(mapUSA, IDs = nms, CRS('+proj=longlat'))
idx <- match(unique(nms), dat$states)
dat2 <- data.frame(value = dat$value[idx], state = unique(nms))
row.names(dat2) <- unique(nms)
USAsp <- SpatialPolygonsDataFrame(USApolygons, data = dat2)
spplot(USAsp['value'], main = "Armed Males with an Attack Threat Level", sub = "Count Per State", col="transparent")
I am looking for some help to what seems like a very simple question. Any advice is greatly appreciated! I have created a data frame and I am looking to assign names under one column (Region) based on the values in the other column (Unit).
rdf<-as.data.frame(matrix(NA, nrow= 59, ncol=2))
colnames(rdf)<-c("Unit", "Region")
rdf$Unit<-c(1:35, 37:60)
rdf$Region<- ## (See below)
Here I want for Units 1:13 <- the region to be East,
for Units 14:25 and 27 the region to be labeled Central,
units 26, 28:38, 40:43, 45:46, to be labeled West,
and then Units 44, 39, 47:60, to be labeled BC.
I've been trying case_when or nested if else statements, but I am getting errors relating to a longer object length is not a multiple of shorter object length.
We can use dplyr. case_when is the way to go here.
library(dplyr)
rdf %>%
mutate(Region = case_when(Unit %in% 1:13 ~ "East",
Unit %in% c(14:25, 27) ~ "Central",
Unit %in% c(26, 28:38, 40:43, 45:46) ~ "West",
Unit %in% c(44, 39, 47:60) ~ "BC"))
Unit Region
1 1 East
2 2 East
3 3 East
4 4 East
5 5 East
6 6 East
7 7 East
8 8 East
9 9 East
10 10 East
11 11 East
12 12 East
13 13 East
14 14 Central
15 15 Central
16 16 Central
17 17 Central
18 18 Central
19 19 Central
20 20 Central
21 21 Central
22 22 Central
23 23 Central
24 24 Central
25 25 Central
26 26 West
27 27 Central
28 28 West
29 29 West
30 30 West
31 31 West
32 32 West
33 33 West
34 34 West
35 35 West
36 37 West
37 38 West
38 39 BC
39 40 West
40 41 West
41 42 West
42 43 West
43 44 BC
44 45 West
45 46 West
46 47 BC
47 48 BC
48 49 BC
49 50 BC
50 51 BC
51 52 BC
52 53 BC
53 54 BC
54 55 BC
55 56 BC
56 57 BC
57 58 BC
58 59 BC
59 60 BC
Because this is a small data frame, using a translation table might be the most convenient way:
xlat <- list(East=c(1:13),
Central=c(14:25,27),
West=c(26,28:38,40:43,45:46),
BC=c(44,39,47:60))
rdf$Region <- NA
for (r in names(xlat)) rdf$Region[rdf$Unit %in% xlat[[r]]] <- r
This solution (a) clearly documents the recoding; (b) will indicate if you have overlooked any unit number (by setting Region to NA); and (c) is easy to alter and maintain.
For larger tables, learn about the join operation among relations. It has many implementations in R.
I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))
When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.