Subset around values that follow a pattern - r

Hopefully, this is a fairly straight forward question. I am using R to help subset some data that I am working with. Below is print() of some of the data that I am currently working with. I am trying to create a subset() of the data based around JobCode. As you can see the JobCode follows a pattern (00 - 0000) where the first 2 numbers are the same for a specific industry.
ID State StateName JobCode
1 AL Alabama 51-9199
2 AL Alabama 27-3011
4 AL Alabama 49-9043
5 AL Alabama 49-2097
My current attempt is to use this test <- subset(data, data$State == "AL" & data$JobCode == ("15-####"))(where # is a placeholder for the remaining 4 values) to subset for JobCode beginning with "15-". Is there any way to tell the subset to look for those remaining 4 values?
I'm sorry for the poor formatting as I am new to StackOverflow and I am also quite inexperienced with R. Thank you for your help.

There are no wild card characters in string equality. You need to use a function. You could use substr() to extract the first three charcters
test <- subset(data, State == "AL" & substr(JobCode,1,3) == ("15-"))
Also note that you don't need to use data$ inside the subset() parameter. Variables are evaulated in the context of the data frame for that function.

You can use the %like% operator of data.table library:
library(data.table)
setDT(df)
df[ State == "AL" & JobCode %like% "15-" ]

Related

Is there a way to map or match people's names to religions in R?

I'm working on a paper on electoral politics and tried using this dataset to calculate the share of the electorate that each religion,so I created an if() function and a Christian variable and tried to increase the number of Christians by one whenever a Christian name pops up, but was unable to do so. Would appreciate it if you could help me with this
library(dplyr)
library(ggplot2)
Christian=0
if(Sample...Sheet1$V2=="James"){
Christian=Christian+1
}
PS
The Output
Warning message:
In if (Sample...Sheet1$V2 == "James") { :
the condition has length > 1 and only the first element will be used
Notwithstanding my comment about the fundamental non-validity of this approach, here’s how you would solve this general problem in R:
Generate a lookup table of the different names and categories — this table is independent of your input data:
religion_lookup = tribble(
~ Name, ~ Religion,
'James', 'Christian',
'Christopher', 'Christian',
'Ahmet', 'Muslim',
'Mohammed', 'Muslim',
'Miriam', 'Jewish',
'Tarjinder', 'Sikh'
)
match your input data against the lookup table (I’m using an input table data with a column Name instead of your Sample...Sheet1$V2):
matched = match(data$Name, religion_lookup$Name)
religion = religion_lookup$Religion[matched]
Count the results:
table(religion)
religion
Christian Jewish Muslim Sikh
2 5 3 1
Note the lack of ifs and loops in the above.
Christian <- sum( Sample...Sheet1$V2=="James" )
There goes, don't need the if block.

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

subset data frame based on character value

I'm trying to subset a data frame that I imported with read.table using the colClasses='character' option.
A small sample of the data can be found here
Full99<-read.csv("File.csv",header=TRUE,colClasses='character')
After removing duplicates, missing values, and all unnecessary columns I get a data frame of these dimmensions:
>dim(NoMissNoDup99)
[1] 81551 6
I'm interested in reducing the data to only include observations of a specific Service.Type
I've tried with the subset function:
MU99<-subset(NoMissNoDup99,Service.Type=='Apartment'|
Service.Type=='Duplex'|
Service.Type=='Triplex'|
Service.Type=='Fourplex',
select=Service.Type:X.13)
dim(MU99)
[1] 0 6
MU99<-NoMissNoDup99[which(NoMissNoDup99$Service.Type!='Hospital'
& NoMissNoDup99$Service.Type!= 'Hotel or Motel'
& NoMissNoDup99$Service.Type!= 'Industry'
& NoMissNoDup99$Service.Type!= 'Micellaneous'
& NoMissNoDup99$Service.Type!= 'Parks & Municipals'
& NoMissNoDup99$Service.Type!= 'Restaurant'
& NoMissNoDup99$Service.Type!= 'School or Church or Charity'
& NoMissNoDup99$Service.Type!='Single Residence'),]
but that doesn't remove observations.
I've tried that same method but slightly tweaked...
MU99<-NoMissNoDup99[which(NoMissNoDup99$Service.Type=='Apartment'
|NoMissNoDup99$Service.Type=='Duplex'
|NoMissNoDup99$Service.Type=='Triplex'
|NoMissNoDup99$Service.Type=='Fourplex'), ]
but that removes every observation...
The final subset should have somewhere around 8000 observations
I'm pretty new to R and Stack Overflow, so I apologize if there's some convention of posting I've neglected to follow, but if anyone has a magic bullet to get this data to cooperate, I'd love your insights :)
The different methods should work if you were using the right variable values. Your issue likely is extra spaces in your variable names.
You can avoid this kind of issues using grep for example:
NoMissNoDup99[grep("Apartment|Duplex|Business",NoMissNoDup99$Service.Type),]
## exclude
MU99<-subset(NoMissNoDup99,!(Service.Type %in% c('Hospital','Hotel or Motel')))
##include
MU99<-subset(NoMissNoDup99,Service.Type %in% c('Apartment','Duplex'))

How to remove specific duplicates in R

I have the following data:
> head(bigdata)
type text
1 neutral The week in 32 photos
2 neutral Look at me! 22 selfies of the week
3 neutral Inside rebel tunnels in Homs
4 neutral Voices from Ukraine
5 neutral Water dries up ahead of World Cup
6 positive Who's your hero? Nominate them
My duplicates will look like this (with empty $type):
7 Who's your hero? Nominate them
8 Water dries up ahead of World Cup
I remove duplicates like this:
bigdata <- bigdata[!duplicated(bigdata$text),]
The problem is, it removes the wrong duplicate. I want to remove the one where $type is empty, not the one that has a value for $type.
How can I remove a specific duplicate in R?
So here's a solution that does not use duplicated(...).
# creates an example - you have this already...
set.seed(1) # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
text=sample(letters[1:10],10),
stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))
# you start here...
newdf <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]
This sorts bigdata by text and type, in decreasing order, so that for a given text, the empty type will appear after any non-empty type. Then we extract only the first occurrence of each type for every text.
If your data really is "big", then a data.table solution will probably be faster.
library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]
This does basically the same thing, but since setkey sorts only in increasing order, we use type[.N] to get the last occurrence of type for a every text. .N is a special variable that holds the number of elements for that group.
Note that the current development version implements a function setorder(), which orders a data.table by reference, and can order in both increasing and decreasing order. So, using the devel version, it'd be:
require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]
You should keep rows that are either not duplicated or not missing a type value. The duplicated function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2))), so we need to use both that value and the value of duplicated called with fromLast=TRUE.
bigdata <- bigdata[!(duplicated(bigdata$text) |
duplicated(bigdata$text, fromLast=TRUE)) |
!is.na(bigdata$type),]
foo = function(x){
x == ""
}
bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

Simple lookup to insert values in an R data frame

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.
Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").
Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]
Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.
The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2

Resources