Subset data frame in R given grouping length criterium - r

I'm working on some exercises based on this dataset.
There's a State column listing the rate of deaths per month by heart attack for each hospital of the state (column 11):
> table(data$State)
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96
Now I try to filter out these states where at least 20 values are available:
> table(data$State)>20
AK AL AR AZ CA CO CT DC DE FL GA GU
FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
So using subset I try to get a subset of data based on the above conditions, but that gives me a result I can't follow:
> data_subset <- subset(data, table(data$State)>20)
> table(data_subset$State)
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY
14 84 66 65 288 64 25 8 5 155 109 1 19 93 24 153 107 100 83
Why am I getting AK 14, when I would expect that state to be filtered out by the condition?

You can use the following approach to filter out the data with less than 20 rows:
tab <- table(data$State)
data[data$State %in% names(tab)[tab > 19], ]
Your code
subset(data, table(data$State)>20)
does not work because table(data$State)>20 returns a boolean vector of length length(table$State). In your data, the boolean vector is shorter than the number of rows in your data frame. Due to vector recycling, the vector is combined with itself until the longer length is reached. E.g., have a look at (1:3)[c(TRUE, FALSE)].

Related

R - Filter a specific variable based on other variables

I have the problem that I want to filter my variable Position(containing 5 atomic levels: Analyst, CEO, Analyst level II, Manger II, Ceo Level II) for age.
This means that I want to remove Analyst level II","Ceo level II","Manger level II" if their age is below 58 or keep them if their age is above 58. The other atomic levels (Analyst, CEO) shouldn't be affected by the age constraint. (example: analyst, age=50 should be kept)
library(tidyverse)
Test<- tibble(Age=50:69,Position=rep(c("Analyst","Analyst Level II","Ceo level II", "Manager", "Manager level II"), times=4),Value=201:220)
exam32 <-Test %>%
filter(!Position==c("Analyst level II","Ceo level II","Manager level II"), Age>58)
View(exam32)
Hope you can help
Use %in% to match the string, and & specifying that both condition should be satisfied.
Test %>%
filter(!(Position %in% c("Analyst level II",
"Ceo level II",
"Manager level II") & Age < 58))
# # A tibble: 17 x 3
# Age Position Value
# <int> <chr> <int>
# 1 50 Analyst 201
# 2 51 Analyst Level II 202
# 3 53 Manager 204
# 4 55 Analyst 206
# 5 56 Analyst Level II 207
# 6 58 Manager 209
# 7 59 Manager level II 210
# 8 60 Analyst 211
# 9 61 Analyst Level II 212
# 10 62 Ceo level II 213
# 11 63 Manager 214
# 12 64 Manager level II 215
# 13 65 Analyst 216
# 14 66 Analyst Level II 217
# 15 67 Ceo level II 218
# 16 68 Manager 219
# 17 69 Manager level II 220

Store values in a cell dataframe

I am trying to store in multiple cells in a dataframe. But, my code is storing the data in the last cell (on the dd array). Please see my output below.
Can somebody please correct me? Cannot figure out what I am doing wrong.
Thanks in advance,
MyData <- read.csv(file="Pat_AR_035.csv", header=TRUE, sep=",")
dd <- unique(MyData$POLICY_NUM)
for (j in length(dd)) {
myDF <- data.frame(i=1:length(dd), m=I(vector('list', length(dd))))
myDF$m[[j]] <- data.frame(j,MyData[which(MyData$POLICY_NUM==dd[j] & MyData$ACRES), ],ncol(MyData),nrow(MyData))
}
[[60]]
NULL
[[61]]
NULL
[[62]]
NULL
[[63]]
j OBJECTID DIVISION POLICY_SYM POLICY_NUM YIELD_ID LINE_ID RH_CLU_ID ACRES PLANT_DATE ACRE_TYPE CLU_DETERM STATE COUNTY FARM_SERIA TRACT
1646 63 1646 8 MP 754033 3 20 39565604 8.56 5/3/2014 PL A 3 35 109 852
1647 63 1647 8 MP 754033 1 10 39565605 30.07 4/19/2014 PL A 3 35 109 852
1648 63 1648 8 MP 754033 1 10 39565606 56.59 4/19/2014 PL A 3 35 109 852
CLU_NUMBER FIELD_ACRE RMA_CLU_ID UPDATE_DAT Percent_Ar RHCLUID Field1 OBJECTID_1 DIVISION_1 STATE_1 COUNTY_1
1646 3 8.56 F68E591A-ECC2-470B-A012-201C3BB20D7F 9/21/2014 63.4990 39565604 1646 1646 8 3 35
1647 1 30.07 eb04cfc0-e78b-415f-b447-9595c81ef09e 9/21/2014 100.0000 39565605 1647 1647 8 3 35
1648 2 56.59 5922d604-e31c-4b9d-b846-9f38e2d18abe 9/21/2014 92.1442 39565606 1648 1648 8 3 35
POLICY_N_1 YIELD_ID_1 RH_CLU_ID_ short_dist coords_x1 coords_x2 optional SHAPE_Leng SHAPE_Area ncol.MyData. nrow.MyData.
1646 754033 3 39565604 5.110837 516747.8 -221751.4 TRUE 831.3702 34634.73 35 1757
1647 754033 1 39565605 5.606284 515932.1 -221702.0 TRUE 1469.4800 121611.46 35 1757
1648 754033 1 39565606 5.325399 516380.1 -221640.9 TRUE 1982.8757 228832.22 35 1757
for (j in length(dd))
This doesn’t iterate over dd — it iterates over a single number: the length of dd. Not much of an iteration. You probably meant to write the following or something similar:
for (j in seq_along(dd))
However, there are more issues with your code. For instance, the myDF variable is continuously overwritten inside your loop, which probably isn’t what you intended at all. Instead, you should probably create objects in an lapply statement and forego the loop.

Combining factor levels in R 3.2.1

In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:
whittle = function(data, cutoff_val){
#convert to a data frame
tab = as.data.frame.table(table(data))
#returns vector of indices where value is below cutoff_val
idx = which(tab$Freq < cutoff_val)
levels(data)[idx] = "Other"
return(data)
}
This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:
> sort(table(data$State))
05 27 35 40 54 84 9 AP AU BE BI DI G GP GU GZ HN HR JA JM KE KU L LD LI MH NA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
OU P PL RM SR TB TP TW U VD VI VS WS X ZH 47 BL BS DL M MB NB RP TU 11 DU KA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
BW ND NS WY AK SD 13 QC 01 BC MT AB HE ID J NO LN NM ON NE VT UT IA MS AO AR ME
4 4 4 4 5 5 6 6 7 7 7 8 8 8 9 10 11 17 23 26 26 30 31 31 38 40 44
OR KS HI NV WI OK KY IN WV AL CO WA MN NH MO SC LA TN AZ IL NC MI GA OH ** CT DE
45 47 48 57 57 64 106 108 112 113 120 125 131 131 135 138 198 200 233 492 511 579 645 646 840 873 1432
RI DC TX MA FL VA MD CA NJ PA NY
1782 2513 6992 7027 10527 11016 11836 12221 15485 16359 34045
Now when I use whittle it returns me the following message:
> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other", :
duplicated levels in factors are deprecated
How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?
I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your as.data.frame.table and using replace to do the replacement of the low-frequency levels:
whittle <- function(data, cutoff_val) {
tab = as.data.frame.table(table(data))
factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}
Testing on some sample data:
state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD MD MD Other Other
# Levels: MD Other
I think this verison should work. The levels<- function allows you to collapse by assigning a list (see ?levels).
whittle <- function(data, cutoff_val){
tab <- table(data)
shouldmerge <- tab < cutoff_val
tokeep <- names(tab)[!shouldmerge]
tomerge <- names(tab)[shouldmerge]
nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
levels(data)<-nv
return(data)
}
And we test it with
set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
# a b c d e f g h i j k l m
# 5 11 8 8 7 5 13 14 14 15 2 3 5
y <- whittle(x, 9)
table(y)
# y
# b g h i j Other
# 11 13 14 14 15 43
It's worth adding to this answer that the new forcats package contains the fct_lump() function which is dedicated to this.
Using #MrFlick's data:
x <- factor(c(sample(letters[1:10], 100, replace=T),
sample(letters[11:13], 10, replace=T)))
library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table
# b g h i j Other
#11 13 14 14 15 43
The n argument specifies the number of most common values to preserve.
Here's another way of doing it by replacing all the items below the threshold with the first and then renaming that level to Other.
whittle <- function(x, thresh) {
belowThresh <- names(which(table(x) < thresh))
x[x %in% belowThresh] <- belowThresh[1]
levels(x)[levels(x) == belowThresh[1]] <- "Other"
factor(x)
}

How do I exclude rows in R based on multiple values?

Let's say I have a dataset that looks like this:
> data
iso3 Vaccine Coverage
1 ARG DPT3 95
2 ARG MCV 94
3 ARG Pol3 91
4 KAZ DPT3 99
5 KAZ MCV 98
6 KAZ Pol3 99
7 COD DPT3 67
8 COD MCV 62
9 COD Pol3 66
I want to filter out some records based on several conditions being met simultaneously; say, I want to drop any data from Argentina (ARG) with a coverage of more than 93 percent. The result should thus exclude rows 1 and 2:
iso3 Vaccine Coverage
3 ARG Pol3 91
4 KAZ DPT3 99
5 KAZ MCV 98
6 KAZ Pol3 99
7 COD DPT3 67
8 COD MCV 62
9 COD Pol3 66
I tried using subset() but it excludes too much:
> subset(data, iso3!="ARG" & Coverage>93)
iso3 Vaccine Coverage
4 KAZ DPT3 99
5 KAZ MCV 98
6 KAZ Pol3 99
The problem seems to be that the & operator doesn't seem to work like the boolean AND, returning the intersection of the two conditions. Instead, it functions like a boolean OR, returning their union.
My question is, what do I use here to force the boolean AND?
!= is an operator meaning "not equal".
! indicates logical negation (NOT)
Your condition
iso3!="ARG" & Coverage>93
is
(iso3 not equal to "ARG") AND (Coverage > 93)
If you want
NOT((iso equal to "ARG") AND (Coverage > 93))
You need to create a condition appropriately, eg
eg
!(iso == 'ARG' & Coverage > 93)
For a complete coverage of logical operators in base R see
help('Logic', package='base')

Count rows for selected column values and remove rows based on count in R

I am new to R and am trying to work on a data frame from a csv file (as seen from the code below). It has hospital data with 46 columns and 4706 rows (one of those columns being 'State'). I made a table showing counts of rows for each value in the State column. So in essence the table shows each state and the number of hospitals in that state. Now what I want to do is subset the data frame and create a new one without the entries for which the state has less than 20 hospitals.
How do I count the occurrences of values in the State column and then remove those that count up to less than 20? Maybe I am supposed to use the table() function, remove the undesired data and put that into a new data frame using something like lappy(), but I'm not sure due to my lack of experience in programming with R.
Any help will be much appreciated. I have seen other examples of removing rows that have certain column values in this site, but not one that does that based on the count of a particular column value.
> outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
> hospital_nos <- table(outcome$State)
> hospital_nos
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37 134
MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA
133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370 42 87
VI VT WA WI WV WY
2 15 88 125 54 29
Here is one way to do it. Starting with the following data frame :
df <- data.frame(x=c(1:10), y=c("a","a","a","b","b","b","c","d","d","e"))
If you want to keep only the rows with more than 2 occurrences in df$y, you can do :
tab <- table(df$y)
df[df$y %in% names(tab)[tab>2],]
Which gives :
x y
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
And here is a one line solution with the plyr package :
ddply(df, "y", function(d) {if(nrow(d)>2) d else NULL})

Resources