Populating column based on row matches without for loop - r

Is there a way to obtain the annual count values based on the state, species, and year, without using a for loop?
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | ?
Dora | CA | 2 | Regal Tang | ?
Lookup table:
State | Species | Year | AnnualCt
NY | Clownfish | 2012 | 500
NY | Clownfish | 2014 | 200
CA | Regal Tang | 2001 | 400
CA | Regal Tang | 2014 | 680
CA | Regal Tang | 2000 | 700
The output would be:
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | 200
Dora | CA | 2 | Regal Tang | 680
What I've tried:
pets <- data.frame("Name" = c("Nemo","Dora"), "State" = c("NY","CA"),
"Age" = c(5,2), "Species" = c("Clownfish","Regal Tang"))
fishes <- data.frame("State" = c("NY","NY","CA","CA","CA"),
"Species" = c("Clownfish","Clownfish","Regal Tang",
"Regal Tang", "Regal Tang"),
"Year" = c("2012","2014","2001","2014","2000"),
"AnnualCt" = c("500","200","400","680","700"))
pets["AnnualCt"] <- NA
for (row in (1:nrow(pets))){
pets$AnnualCt[row] <- as.character(droplevels(fishes[which(fishes$State == pets[row,]$State &
fishes$Species == pets[row,]$Species &
fishes$Year == 2014),
which(colnames(fishes)=="AnnualCt")]))
}

I'm confused as to what you're trying to do; isn't this just this?
library(dplyr);
left_join(pets, fishes) %>%
filter(Year == 2014) %>%
select(-Year);
#Joining, by = c("State", "Species")
# Name State Age Species AnnualCt
#1 Nemo NY 5 Clownfish 200
#2 Dora CA 2 Regal Tang 680
Explanation: left_join both data.frames by State and Species, filter for Year == 2014 and output without Year column.

Related

R Studio: Match first n characters between two columns, and fill in value from another column

I have a dataframe "city_table" that looks like this:
+---+---------------------+
| | city |
+---+---------------------+
| 1 | Chicago-2234dxsw |
+---+---------------------+
| 2 | Chicago,IL |
+---+---------------------+
| 3 | Chicago |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx |
+---+---------------------+
| 6 | Atlanta- 1234xx |
+---+---------------------+
| 7 | Atlanta, GA |
+---+---------------------+
| 8 | Atlanta - 123456T |
+---+---------------------+
I have another city code lookup table "city_lookup" that looks like this:
+---+--------------+-----------+
| | city_name | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL | 001 |
+---+--------------+-----------+
| 2 | Atlanta, GA | 002 |
+---+--------------+-----------+
As you can see, city names in "city" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).
I would like the final table that, through matching first n characters (let's day, n=7) between city_table$city vs. city_lookup$city_name, return me the city code properly, sth like this:
+---+---------------------+-----------+
| | city_name | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw | 001 |
+---+---------------------+-----------+
| 2 | Chicago,IL | 001 |
+---+---------------------+-----------+
| 3 | Chicago | 001 |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001 |
+---+---------------------+-----------+
| 5 | Chicago_2133xx | 001 |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx | 002 |
+---+---------------------+-----------+
| 7 | Atlanta, GA | 002 |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T | 002 |
+---+---------------------+-----------+
I am doing this in R, preferably using tidyverse/dplyr. Thanks so much for your help!
Even better, as long as the characters after the full city names are always non-letters, you can match the entire city name as so:
city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001", "002"))
city_table %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
left_join(city_lookup %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
by = "city_clean") %>%
select(-city_clean, -city_name)
city city_code
<chr> <chr>
1 Chicago-2234dxsw 001
2 Chicago,IL 001
3 Atlanta - 123456T 002
We can create columns with substring (as the OP asked in the question) and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
city_table %>%
mutate(city_sub = substring(city, 1, 7)) %>%
regex_left_join(city_lookup %>%
mutate(city_sub = substring(city_name, 1, 7)),
by = 'city_sub') %>%
select(city_name = city, city_code)
-output
# city_name city_code
#1 Chicago-2234dxsw 001
#2 Chicago,IL 001
#3 Chicago 001
#4 Chicago - 124421xsd 001
#5 Chicago_2133xx 001
#6 Atlanta- 1234xx 002
#7 Atlanta, GA 002
#8 Atlanta - 123456T 002
data
city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago",
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA",
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA,
-8L))
city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001",
"002")), class = "data.frame", row.names = c(NA, -2L))

R: how to filter out rows that end with a specific list characters?

I have a data frame that looks like this:
+-----------------+--------+
| Geography | Values |
+-----------------+--------+
| Atlanta, GA | 78 |
+-----------------+--------+
| New York, NY | 30 |
+-----------------+--------+
| Denver, CO | 20 |
+-----------------+--------+
| Omaha, NE | 178 |
+-----------------+--------+
| Los Angeles, CA | 58 |
+-----------------+--------+
| Providence, RI | 100 |
+-----------------+--------+
| Little Rock, AR | 20 |
+-----------------+--------+
| Miami, FL | 50 |
+-----------------+--------+
| ... | |
+-----------------+--------+
I would look to perform an operation in tidyverse/dplyr format so that I can filter out any rows that is from the state of GA & CA. Notice that there is always a ", " (a comma, followed by a space) before the state abbreviation.
The resulting dataframe should look like:
+-----------------+--------+
| Geography | Values |
+-----------------+--------+
| New York, NY | 30 |
+-----------------+--------+
| Denver, CO | 20 |
+-----------------+--------+
| Omaha, NE | 178 |
+-----------------+--------+
| Providence, RI | 100 |
+-----------------+--------+
| Little Rock, AR | 20 |
+-----------------+--------+
| Miami, FL | 50 |
+-----------------+--------+
| ... | |
+-----------------+--------+
The real data is much larger than this simple example. It is consists of hundreds of cities with multiple cities in a state, so I can not simply do something like:
data %>%
filter (Geography == "Atlanta, GA" | Geography == "Los Angeles, CA")
Should I create a new "State" column that takes out the last 2 letters of the "Geography" column, and filter on that "State" column, or can I do something regex related such as:
exclude_list = c("GA, CA")
data %>%
filter (Geography != end_with(exclude_list))
What is an elegant way to do this? Thanks so much for your help!
You can construct exclude_list as :
exclude_list = c("GA", "CA")
Then use subset as :
subset(data, !grepl(sprintf('(%s)$',
paste0(exclude_list, collapse = '|')), Geography))
Or if you need dplyr answer do :
library(dplyr)
data %>%
filter(!grepl(sprintf('(%s)$',
paste0(exclude_list, collapse = '|')), Geography))
where
sprintf('(%s)$', paste0(exclude_list, collapse = '|')) #returns
#[1] "(GA|CA)$"
If exclude_list is too big the regex answer might fail in such case suggestion by #thelatemail would be helpful where we keep only the state name and match them with %in% :
data[!sub("^.+,\\s+", "", data$Geography) %in% exclude_list,]
I would recommend doing this with regex. The $ in a regular expression indicates the end of the line. So grepl(" CA$", Geography) will return true if the geography ends with a space and the letters CA.
Hence I would do something like:
data %>%
filter(!grepl(" CA$", Geography),
!grepl(" GA$", Geography))
A data.table option with grepl
> setDT(df)[!grepl(",\\s+(CA|GA)$", Geography)]
Geography Values
1: New York, NY 30
2: Denver, CO 20
3: Omaha, NE 178
4: Providence, RI 100
5: Little Rock, AR 20
6: Miami, FL 50
or subset if you are with base R
> subset(df, !grepl(",\\s+(CA|GA)$", Geography))
Geography Values
2 New York, NY 30
3 Denver, CO 20
4 Omaha, NE 178
6 Providence, RI 100
7 Little Rock, AR 20
8 Miami, FL 50
Data
> dput(df)
structure(list(Geography = c("Atlanta, GA", "New York, NY", "Denver, CO",
"Omaha, NE", "Los Angeles, CA", "Providence, RI", "Little Rock, AR",
"Miami, FL"), Values = c(78L, 30L, 20L, 178L, 58L, 100L, 20L,
50L)), class = c("data.table", "data.frame"), row.names = c(NA,
-8L))

Grouping by column and finding preceeding value of another column

I have a very long sales data, below an exemplary excerpt:
| Date | CountryA | CountryB | PriceA | PriceB | |
+------------+----------+----------+--------+--------+--+
| 05/09/2019 | US | Japan | 20 | 55 | |
| 28/09/2019 | Japan | Germany | 30 | 28 | |
| 16/10/2019 | Canada | US | 25 | 78 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | |
+------------+----------+----------+--------+--------+--+
I would like to group on column "CountryB" and then generate a new column which displays the preceding value of PriceA of that respective country, i.e. when that specific country was present in column "CountryA" the last time based on date order. In this exemplary table, I want to get the following results:
| Date | CountryA | CountryB | PriceA | PriceB | PriceA_lag1 | |
+------------+----------+----------+--------+--------+-------------+--+
| 05/09/2019 | US | Japan | 20 | 55 | | |
| 28/09/2019 | Japan | Germany | 30 | 28 | | |
| 16/10/2019 | Canada | US | 25 | 78 | 20 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | 30 | |
+------------+----------+----------+--------+--------+-------------+--+
I have tried the following with dplyr:
data=data%>%group_by(CountryB)%>%mutate_at(list(lag1=~dplyr::lag(.,1,order_by=Date)),.vars=vars(PriceA))
However this does not give me the preceding value when the respective country is in column "CountryA", but rather when the respective country is in "CountryB".
Can someone please help me out on this one?
Thanks.
Quite possibly some of the ugliest code I've written, but...
# install.packages('dplyr', 'magrittr')
library(dplyr)
library(magrittr)
d <- data.frame(
stringsAsFactors = FALSE,
Date = c("05/09/2019", "28/09/2019", "16/10/2019", "28/10/2019"),
CountryA = c("US", "Japan", "Canada", "Germany"),
CountryB = c("Japan", "Germany", "US", "Japan"),
PriceA = c(20L, 30L, 25L, 60L),
PriceB = c(55L, 28L, 78L, 17L)
) %>%
mutate(Date = as.Date(Date, format = '%d/%m/%Y'))
priceA_lag <- c()
for(row in 1:nrow(d)){
country <- slice(d, row) %$% CountryB
date <- slice(d, row) %$% Date
thePrice <- d %>%
filter(CountryA == country,
date > Date) %>%
filter(Date == max(Date)) %$%
PriceA
thePrice <- ifelse(length(thePrice) > 0, thePrice, NA)
priceA_lag <- priceA_lag %>%
append(thePrice)
}
d$priceA_lag <- priceA_lag
> d
Date CountryA CountryB PriceA PriceB priceA_lag
1 2019-09-05 US Japan 20 55 NA
2 2019-09-28 Japan Germany 30 28 NA
3 2019-10-16 Canada US 25 78 20
4 2019-10-28 Germany Japan 60 17 30

Remove data from DF based on multiple criteria

I have a large data frame (df) that looks something like the following sample. There are a number of data entry errors in the data set and I need to remove these. In the sample data all NSW States should have a Postcode starting with 2. All VIC States should have a Postcode starting with 3.
| Suburb | State | Postcode |
| ------ | ----- | -------- |
| FLEMINGTON | NSW | 2140 |
| FLEMINGTON | NSW | 2144 |
| FLEMINGTON | NSW | 3996 |
| FLEMINGTON | VIC | 2996 |
| FLEMINGTON | VIC | 3021 |
| FLEMINGTON | VIC | 3031 |
I need the final table to look like...
| Suburb | State | Postcode |
| ------ | ----- | -------- |
| FLEMINGTON | NSW | 2140 |
| FLEMINGTON | NSW | 2144 |
| FLEMINGTON | VIC | 3021 |
| FLEMINGTON | VIC | 3031 |
The following solution is kind of close, but I don't know how to filter for integers starting with a specific number and am under time pressure.
Extracting rows from df based on multiple conditions in R
Any help would be greatly appreciated.
To make this easily extended on, do it as a merge operation against only your acceptable values for each state:
merge(
transform(dat, Pc1=substr(Postcode,1,1)),
data.frame(State=c("NSW","VIC"),Pc1=c("2","3"))
)
# State Pc1 Suburb Postcode
#1 NSW 2 FLEMINGTON 2140
#2 NSW 2 FLEMINGTON 2144
#3 VIC 3 FLEMINGTON 3021
#4 VIC 3 FLEMINGTON 3031
Try this? If your Postcode are integers & these are the only conditions, it should be pretty straightforward:
df <- data.frame(Suburb = rep("FLEMINGTON", 6),
State = c(rep("NSW", 3), rep("VIC", 3)),
Postcode = c(2140,2144,3996,2996,3021,3031))
library(dplyr)
df <- df %>%
filter((State == "NSW" & Postcode < 3000) | (State == "VIC" & Postcode >= 3000))
> df
Suburb State Postcode
1 FLEMINGTON NSW 2140
2 FLEMINGTON NSW 2144
3 FLEMINGTON VIC 3021
4 FLEMINGTON VIC 3031

How to sort unique values based on another column in R

I would like to extract unique values based on the sum in another column. For example, I have the following data frame "music"
ID | Song | artist | revenue
7520 | Dance with me | R kelly | 2000
7531 | Gone girl | Vincent | 1890
8193 | Motivation | R Kelly | 3500
9800 | What | Beyonce | 12000
2010 | Excuse Me | Pharell | 1010
1999 | Remove me | Jack Will | 500
Basically, I would like to sort the top 5 artists based on revenue, without the duplicate entries on a given artist
You just need order() to do this. For instance:
head(unique(music$artist[order(music$revenue, decreasing=TRUE)]))
or, to retain all columns (although uniqueness of artists would be a little trickier):
head(music[order(music$revenue, decreasing=TRUE),])
Here's the dplyr way:
df <- read.table(text = "
ID | Song | artist | revenue
7520 | Dance with me | R Kelly | 2000
7531 | Gone girl | Vincent | 1890
8193 | Motivation | R Kelly | 3500
9800 | What | Beyonce | 12000
2010 | Excuse Me | Pharell | 1010
1999 | Remove me | Jack Will | 500
", header = TRUE, sep = "|", strip.white = TRUE)
You can group_by the artist, and then you can choose how many entries you want to peak at (here just 3):
require(dplyr)
df %>% group_by(artist) %>%
summarise(tot = sum(revenue)) %>%
arrange(desc(tot)) %>%
head(3)
Result:
Source: local data frame [3 x 2]
artist tot
1 Beyonce 12000
2 R Kelly 5500
3 Vincent 1890

Resources