Remove data from DF based on multiple criteria - r

I have a large data frame (df) that looks something like the following sample. There are a number of data entry errors in the data set and I need to remove these. In the sample data all NSW States should have a Postcode starting with 2. All VIC States should have a Postcode starting with 3.
| Suburb | State | Postcode |
| ------ | ----- | -------- |
| FLEMINGTON | NSW | 2140 |
| FLEMINGTON | NSW | 2144 |
| FLEMINGTON | NSW | 3996 |
| FLEMINGTON | VIC | 2996 |
| FLEMINGTON | VIC | 3021 |
| FLEMINGTON | VIC | 3031 |
I need the final table to look like...
| Suburb | State | Postcode |
| ------ | ----- | -------- |
| FLEMINGTON | NSW | 2140 |
| FLEMINGTON | NSW | 2144 |
| FLEMINGTON | VIC | 3021 |
| FLEMINGTON | VIC | 3031 |
The following solution is kind of close, but I don't know how to filter for integers starting with a specific number and am under time pressure.
Extracting rows from df based on multiple conditions in R
Any help would be greatly appreciated.

To make this easily extended on, do it as a merge operation against only your acceptable values for each state:
merge(
transform(dat, Pc1=substr(Postcode,1,1)),
data.frame(State=c("NSW","VIC"),Pc1=c("2","3"))
)
# State Pc1 Suburb Postcode
#1 NSW 2 FLEMINGTON 2140
#2 NSW 2 FLEMINGTON 2144
#3 VIC 3 FLEMINGTON 3021
#4 VIC 3 FLEMINGTON 3031

Try this? If your Postcode are integers & these are the only conditions, it should be pretty straightforward:
df <- data.frame(Suburb = rep("FLEMINGTON", 6),
State = c(rep("NSW", 3), rep("VIC", 3)),
Postcode = c(2140,2144,3996,2996,3021,3031))
library(dplyr)
df <- df %>%
filter((State == "NSW" & Postcode < 3000) | (State == "VIC" & Postcode >= 3000))
> df
Suburb State Postcode
1 FLEMINGTON NSW 2140
2 FLEMINGTON NSW 2144
3 FLEMINGTON VIC 3021
4 FLEMINGTON VIC 3031

Related

R Studio: Match first n characters between two columns, and fill in value from another column

I have a dataframe "city_table" that looks like this:
+---+---------------------+
| | city |
+---+---------------------+
| 1 | Chicago-2234dxsw |
+---+---------------------+
| 2 | Chicago,IL |
+---+---------------------+
| 3 | Chicago |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx |
+---+---------------------+
| 6 | Atlanta- 1234xx |
+---+---------------------+
| 7 | Atlanta, GA |
+---+---------------------+
| 8 | Atlanta - 123456T |
+---+---------------------+
I have another city code lookup table "city_lookup" that looks like this:
+---+--------------+-----------+
| | city_name | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL | 001 |
+---+--------------+-----------+
| 2 | Atlanta, GA | 002 |
+---+--------------+-----------+
As you can see, city names in "city" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).
I would like the final table that, through matching first n characters (let's day, n=7) between city_table$city vs. city_lookup$city_name, return me the city code properly, sth like this:
+---+---------------------+-----------+
| | city_name | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw | 001 |
+---+---------------------+-----------+
| 2 | Chicago,IL | 001 |
+---+---------------------+-----------+
| 3 | Chicago | 001 |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001 |
+---+---------------------+-----------+
| 5 | Chicago_2133xx | 001 |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx | 002 |
+---+---------------------+-----------+
| 7 | Atlanta, GA | 002 |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T | 002 |
+---+---------------------+-----------+
I am doing this in R, preferably using tidyverse/dplyr. Thanks so much for your help!
Even better, as long as the characters after the full city names are always non-letters, you can match the entire city name as so:
city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001", "002"))
city_table %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
left_join(city_lookup %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
by = "city_clean") %>%
select(-city_clean, -city_name)
city city_code
<chr> <chr>
1 Chicago-2234dxsw 001
2 Chicago,IL 001
3 Atlanta - 123456T 002
We can create columns with substring (as the OP asked in the question) and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
city_table %>%
mutate(city_sub = substring(city, 1, 7)) %>%
regex_left_join(city_lookup %>%
mutate(city_sub = substring(city_name, 1, 7)),
by = 'city_sub') %>%
select(city_name = city, city_code)
-output
# city_name city_code
#1 Chicago-2234dxsw 001
#2 Chicago,IL 001
#3 Chicago 001
#4 Chicago - 124421xsd 001
#5 Chicago_2133xx 001
#6 Atlanta- 1234xx 002
#7 Atlanta, GA 002
#8 Atlanta - 123456T 002
data
city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago",
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA",
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA,
-8L))
city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001",
"002")), class = "data.frame", row.names = c(NA, -2L))

Creating arrays based on multiple criteria, preferable in VBA, but R is also an option

I am new in the Stackoverflow environment and i am trying to sort/create arrays so that i can work with these arrays. I am trying to get the mean and std.
I have a data set of almost 50.000 observations.
An example of the dataset is shown below.
| Person | Product | Date | Price |
|----------|---------|------------|-------|
| Chris | Pear | 01-02-2018 | 10 |
| Tom | Pear | 02-02-2018 | 11 |
| John | Pear | 03-02-2018 | 12 |
| Bill | Pear | 04-02-2018 | 13 |
| Someone | Pear | 05-02-2018 | 14 |
| Chris | Pear | 06-02-2018 | 15 |
| Tom2 | Apples | 07-02-2018 | 16 |
| John | Pear | 08-02-2018 | 17 |
| Bill2 | Pear | 09-02-2018 | 18 |
| Someone2 | Pear | 10-02-2018 | 19 |
Mean price: 14.5
STD:3.028
What I want to have is an array (FOR each of the prices) so i would know what the mean price and std. were at the current date. That would give me only the most recent observation, based on the criteria: Person, Product
So I would end up with something like this (for Pears) at the date 10-02-2018:
+----------+---------+------------+-------+
| Person | Product | Date | Price |
+----------+---------+------------+-------+
| Tom | Pear | 02-02-2018 | 11 |
+----------+---------+------------+-------+
| Bill | Pear | 04-02-2018 | 13 |
+----------+---------+------------+-------+
| Someone | Pear | 05-02-2018 | 14 |
+----------+---------+------------+-------+
| Chris | Pear | 06-02-2018 | 15 |
+----------+---------+------------+-------+
| John | Pear | 08-02-2018 | 17 |
+----------+---------+------------+-------+
| Bill2 | Pear | 09-02-2018 | 18 |
+----------+---------+------------+-------+
| Someone2 | Pear | 10-02-2018 | 19 |
+----------+---------+------------+-------+
Mean price: 15.29
Std: 2.87
Hope that some is able to help out!
In advance many many thanks.
Reproducing your data:
dat <- read.table(
text = gsub(
"[[:punct:]]",
"",
"| Person | Product | Date | Price |
|----------|---------|------------|-------|
| Chris | Pear | 01-02-2018 | 10 |
| Tom | Pear | 02-02-2018 | 11 |
| John | Pear | 03-02-2018 | 12 |
| Bill | Pear | 04-02-2018 | 13 |
| Someone | Pear | 05-02-2018 | 14 |
| Chris | Pear | 06-02-2018 | 15 |
| Tom2 | Apples | 07-02-2018 | 16 |
| John | Pear | 08-02-2018 | 17 |
| Bill2 | Pear | 09-02-2018 | 18 |
| Someone2 | Pear | 10-02-2018 | 19 |"
),
header = T, colClasses = c(rep("character", 3), "integer")
)
Cleaning the resulting table:
library(tidyverse)
library(magrittr)
dat %<>%
mutate_if(is.character, funs(gsub("\\s+", "", .))) %>%
mutate(Date = as.Date(Date, "%d%m%Y"))
Answering your question:
dat %>% # replace %>% with %<>% to save changes
group_by(Person, Product) %>% # group by Person and Product
filter(Date == max(Date)) %>% # Leave only most resent records
ungroup() %>% # ungroup data
arrange(Product, Date) %>% # sort by Product, Date
mutate(Date = format(Date, "%d-%m-%Y")) # output date as in desired output
# A tibble: 8 x 4
Person Product Date Price
<chr> <chr> <chr> <int>
1 Tom2 Apples 07-02-2018 16
2 Tom Pear 02-02-2018 11
3 Bill Pear 04-02-2018 13
4 Someone Pear 05-02-2018 14
5 Chris Pear 06-02-2018 15
6 John Pear 08-02-2018 17
7 Bill2 Pear 09-02-2018 18
8 Someone2 Pear 10-02-2018 19

Turn a column in a table into merged row in R with integers addition

I have this data table which shows incoming calls at a grocery shop.
The call time is when the call came in, the activity_time is when the employee started using the software, activity_des is the description of the activity done, call end is when the call finished, and finally, activity duration is the duration of each activity
Date | Call_time | activity_time | activity_des| Call_end | activity_duration
-----------------------------------------------------------------------------------
2017-05-03 | 08:05:53 | 08:06:03 | Online shop | 08:07:03 | 30
2017-05-03 | 08:07:30 | 08:08:00 | Transfer | 08:10:00 | 25
2017-05-03 | 08:07:30 | 08:08:25 | buy | 08:10:00 | 35
2017-05-03 | 08:07:30 | 08:09:00 | receipt | 08:10:00 | 60
2017-05-04 | 14:34:10 | 14:40:00 | question | 14:41:47 | 66
2017-05-04 | 14:34:10 | 14:41:06 | question | 14:41:47 | 39
...... | ..... | ..... | ..... | ..... | ..
Desired output
Date | Call_time | activity_des | Call_end | activities_duration
---------------------------------------------------------------------------
2017-05-03 | 08:05:53 | Online shop | 08:07:03 | 30
2017-05-03 | 08:07:30 | Transfer,buy,receipt | 08:10:00 | 120
2017-05-04 | 14:34:10 | question | 14:41:47 | 105
...... | ..... | ..... | ..... | ..
So removing the activity_time since we don't need it, merging the different activity_des in the same call together, and then adding activitiy_duration of those merged ones into one value.
Also, in case there are the same two activities occurring together sequentially (just like question), I dont need showing it twice after merging it, just adding the duration time.
Thank you
Using tidyverse :
library(tidyverse)
activity %>%
select(-activity_time) %>%
group_by(Date, Call_time,Call_end) %>%
summarize(activity_des = paste(activity_des,collapse=", "),
activity_duration = sum(activity_duration))
# # A tibble: 3 x 5
# # Groups: Date, Call_time [?]
# Date Call_time Call_end activity_des activity_duration
# <chr> <chr> <chr> <chr> <dbl>
# 1 2017-05-03 08:05:53 08:07:03 Online shop 30
# 2 2017-05-03 08:07:30 08:10:00 Transfer, buy, receipt 120
# 3 2017-05-04 14:34:10 14:41:47 question, question 105
data
activity <- read.table(header=TRUE,stringsAsFactors=FALSE,sep="|",text="
Date | Call_time | activity_time | activity_des| Call_end | activity_duration
2017-05-03 | 08:05:53 | 08:06:03 | Online shop | 08:07:03 | 30
2017-05-03 | 08:07:30 | 08:08:00 | Transfer | 08:10:00 | 25
2017-05-03 | 08:07:30 | 08:08:25 | buy | 08:10:00 | 35
2017-05-03 | 08:07:30 | 08:09:00 | receipt | 08:10:00 | 60
2017-05-04 | 14:34:10 | 14:40:00 | question | 14:41:47 | 66
2017-05-04 | 14:34:10 | 14:41:06 | question | 14:41:47 | 39")
activity[] <- lapply(activity,trimws)
activity$activity_duration <- as.numeric(activity$activity_duration)

Populating column based on row matches without for loop

Is there a way to obtain the annual count values based on the state, species, and year, without using a for loop?
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | ?
Dora | CA | 2 | Regal Tang | ?
Lookup table:
State | Species | Year | AnnualCt
NY | Clownfish | 2012 | 500
NY | Clownfish | 2014 | 200
CA | Regal Tang | 2001 | 400
CA | Regal Tang | 2014 | 680
CA | Regal Tang | 2000 | 700
The output would be:
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | 200
Dora | CA | 2 | Regal Tang | 680
What I've tried:
pets <- data.frame("Name" = c("Nemo","Dora"), "State" = c("NY","CA"),
"Age" = c(5,2), "Species" = c("Clownfish","Regal Tang"))
fishes <- data.frame("State" = c("NY","NY","CA","CA","CA"),
"Species" = c("Clownfish","Clownfish","Regal Tang",
"Regal Tang", "Regal Tang"),
"Year" = c("2012","2014","2001","2014","2000"),
"AnnualCt" = c("500","200","400","680","700"))
pets["AnnualCt"] <- NA
for (row in (1:nrow(pets))){
pets$AnnualCt[row] <- as.character(droplevels(fishes[which(fishes$State == pets[row,]$State &
fishes$Species == pets[row,]$Species &
fishes$Year == 2014),
which(colnames(fishes)=="AnnualCt")]))
}
I'm confused as to what you're trying to do; isn't this just this?
library(dplyr);
left_join(pets, fishes) %>%
filter(Year == 2014) %>%
select(-Year);
#Joining, by = c("State", "Species")
# Name State Age Species AnnualCt
#1 Nemo NY 5 Clownfish 200
#2 Dora CA 2 Regal Tang 680
Explanation: left_join both data.frames by State and Species, filter for Year == 2014 and output without Year column.

How to sort unique values based on another column in R

I would like to extract unique values based on the sum in another column. For example, I have the following data frame "music"
ID | Song | artist | revenue
7520 | Dance with me | R kelly | 2000
7531 | Gone girl | Vincent | 1890
8193 | Motivation | R Kelly | 3500
9800 | What | Beyonce | 12000
2010 | Excuse Me | Pharell | 1010
1999 | Remove me | Jack Will | 500
Basically, I would like to sort the top 5 artists based on revenue, without the duplicate entries on a given artist
You just need order() to do this. For instance:
head(unique(music$artist[order(music$revenue, decreasing=TRUE)]))
or, to retain all columns (although uniqueness of artists would be a little trickier):
head(music[order(music$revenue, decreasing=TRUE),])
Here's the dplyr way:
df <- read.table(text = "
ID | Song | artist | revenue
7520 | Dance with me | R Kelly | 2000
7531 | Gone girl | Vincent | 1890
8193 | Motivation | R Kelly | 3500
9800 | What | Beyonce | 12000
2010 | Excuse Me | Pharell | 1010
1999 | Remove me | Jack Will | 500
", header = TRUE, sep = "|", strip.white = TRUE)
You can group_by the artist, and then you can choose how many entries you want to peak at (here just 3):
require(dplyr)
df %>% group_by(artist) %>%
summarise(tot = sum(revenue)) %>%
arrange(desc(tot)) %>%
head(3)
Result:
Source: local data frame [3 x 2]
artist tot
1 Beyonce 12000
2 R Kelly 5500
3 Vincent 1890

Resources