Grouping by column and finding preceeding value of another column - r

I have a very long sales data, below an exemplary excerpt:
| Date | CountryA | CountryB | PriceA | PriceB | |
+------------+----------+----------+--------+--------+--+
| 05/09/2019 | US | Japan | 20 | 55 | |
| 28/09/2019 | Japan | Germany | 30 | 28 | |
| 16/10/2019 | Canada | US | 25 | 78 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | |
+------------+----------+----------+--------+--------+--+
I would like to group on column "CountryB" and then generate a new column which displays the preceding value of PriceA of that respective country, i.e. when that specific country was present in column "CountryA" the last time based on date order. In this exemplary table, I want to get the following results:
| Date | CountryA | CountryB | PriceA | PriceB | PriceA_lag1 | |
+------------+----------+----------+--------+--------+-------------+--+
| 05/09/2019 | US | Japan | 20 | 55 | | |
| 28/09/2019 | Japan | Germany | 30 | 28 | | |
| 16/10/2019 | Canada | US | 25 | 78 | 20 | |
| 28/10/2019 | Germany | Japan | 60 | 17 | 30 | |
+------------+----------+----------+--------+--------+-------------+--+
I have tried the following with dplyr:
data=data%>%group_by(CountryB)%>%mutate_at(list(lag1=~dplyr::lag(.,1,order_by=Date)),.vars=vars(PriceA))
However this does not give me the preceding value when the respective country is in column "CountryA", but rather when the respective country is in "CountryB".
Can someone please help me out on this one?
Thanks.

Quite possibly some of the ugliest code I've written, but...
# install.packages('dplyr', 'magrittr')
library(dplyr)
library(magrittr)
d <- data.frame(
stringsAsFactors = FALSE,
Date = c("05/09/2019", "28/09/2019", "16/10/2019", "28/10/2019"),
CountryA = c("US", "Japan", "Canada", "Germany"),
CountryB = c("Japan", "Germany", "US", "Japan"),
PriceA = c(20L, 30L, 25L, 60L),
PriceB = c(55L, 28L, 78L, 17L)
) %>%
mutate(Date = as.Date(Date, format = '%d/%m/%Y'))
priceA_lag <- c()
for(row in 1:nrow(d)){
country <- slice(d, row) %$% CountryB
date <- slice(d, row) %$% Date
thePrice <- d %>%
filter(CountryA == country,
date > Date) %>%
filter(Date == max(Date)) %$%
PriceA
thePrice <- ifelse(length(thePrice) > 0, thePrice, NA)
priceA_lag <- priceA_lag %>%
append(thePrice)
}
d$priceA_lag <- priceA_lag
> d
Date CountryA CountryB PriceA PriceB priceA_lag
1 2019-09-05 US Japan 20 55 NA
2 2019-09-28 Japan Germany 30 28 NA
3 2019-10-16 Canada US 25 78 20
4 2019-10-28 Germany Japan 60 17 30

Related

R Studio: Match first n characters between two columns, and fill in value from another column

I have a dataframe "city_table" that looks like this:
+---+---------------------+
| | city |
+---+---------------------+
| 1 | Chicago-2234dxsw |
+---+---------------------+
| 2 | Chicago,IL |
+---+---------------------+
| 3 | Chicago |
+---+---------------------+
| 4 | Chicago - 124421xsd |
+---+---------------------+
| 5 | Chicago_2133xx |
+---+---------------------+
| 6 | Atlanta- 1234xx |
+---+---------------------+
| 7 | Atlanta, GA |
+---+---------------------+
| 8 | Atlanta - 123456T |
+---+---------------------+
I have another city code lookup table "city_lookup" that looks like this:
+---+--------------+-----------+
| | city_name | city_code |
+---+--------------+-----------+
| 1 | Chicago, IL | 001 |
+---+--------------+-----------+
| 2 | Atlanta, GA | 002 |
+---+--------------+-----------+
As you can see, city names in "city" are messy and formatted differently, while as the city names in "city_code" are following unified format (city,STATE).
I would like the final table that, through matching first n characters (let's day, n=7) between city_table$city vs. city_lookup$city_name, return me the city code properly, sth like this:
+---+---------------------+-----------+
| | city_name | city_code |
+---+---------------------+-----------+
| 1 | Chicago-2234dxsw | 001 |
+---+---------------------+-----------+
| 2 | Chicago,IL | 001 |
+---+---------------------+-----------+
| 3 | Chicago | 001 |
+---+---------------------+-----------+
| 4 | Chicago - 124421xsd | 001 |
+---+---------------------+-----------+
| 5 | Chicago_2133xx | 001 |
+---+---------------------+-----------+
| 6 | Atlanta- 1234xx | 002 |
+---+---------------------+-----------+
| 7 | Atlanta, GA | 002 |
+---+---------------------+-----------+
| 8 | Atlanta - 123456T | 002 |
+---+---------------------+-----------+
I am doing this in R, preferably using tidyverse/dplyr. Thanks so much for your help!
Even better, as long as the characters after the full city names are always non-letters, you can match the entire city name as so:
city_table <- tibble(city = c("Chicago-2234dxsw", "Chicago,IL", "Atlanta - 123456T"))
city_lookup <- tibble(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001", "002"))
city_table %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city)) %>%
left_join(city_lookup %>%
mutate(city_clean = gsub("^([a-zA-Z]*).*", "\\1", city_name, perl = T)),
by = "city_clean") %>%
select(-city_clean, -city_name)
city city_code
<chr> <chr>
1 Chicago-2234dxsw 001
2 Chicago,IL 001
3 Atlanta - 123456T 002
We can create columns with substring (as the OP asked in the question) and then do a regex_left_join
library(dplyr)
library(fuzzyjoin)
city_table %>%
mutate(city_sub = substring(city, 1, 7)) %>%
regex_left_join(city_lookup %>%
mutate(city_sub = substring(city_name, 1, 7)),
by = 'city_sub') %>%
select(city_name = city, city_code)
-output
# city_name city_code
#1 Chicago-2234dxsw 001
#2 Chicago,IL 001
#3 Chicago 001
#4 Chicago - 124421xsd 001
#5 Chicago_2133xx 001
#6 Atlanta- 1234xx 002
#7 Atlanta, GA 002
#8 Atlanta - 123456T 002
data
city_table <- structure(list(city = c("Chicago-2234dxsw", "Chicago,IL", "Chicago",
"Chicago - 124421xsd", "Chicago_2133xx", "Atlanta- 1234xx", "Atlanta, GA",
"Atlanta - 123456T")), class = "data.frame", row.names = c(NA,
-8L))
city_lookup <- structure(list(city_name = c("Chicago, IL", "Atlanta, GA"),
city_code = c("001",
"002")), class = "data.frame", row.names = c(NA, -2L))

Using pivot_wider to get true or false [duplicate]

This question already has answers here:
Aggregate and reshape from long to wide
(2 answers)
Closed 2 years ago.
I'm trying to use pivot_wider to get a binary result for each country in each year between 1991 - 1995 like this table:
+------+-------+--------+--------+
| year | USA | Israel | Sweden |
| 1991 | FALSE | TRUE | TRUE |
| 1992 | FALSE | FALSE | TRUE |
| 1993 | FALSE | TRUE | TRUE |
| 1994 | FALSE | FALSE | TRUE |
| 1995 | TRUE | TRUE | TRUE |
+------+-------+--------+--------+
Of course, any binary indication will be great, besides true/false.
However, my data frame looks like:
country = c("Sweden", "Sweden", "Sweden", "Sweden", "Sweden", "Israel", "Israel",
"Israel", "USA")
year = c(1991,1992,1993,1994,1995,1991,1993,1995,1995)
df = as.data.frame(cbind(year,country))
df
+---------+------+
| country | Year |
| Sweden | 1991 |
| Sweden | 1992 |
| Sweden | 1993 |
| Sweden | 1994 |
| Sweden | 1995 |
| Israel | 1991 |
| Israel | 1993 |
| Israel | 1995 |
| USA | 1995 |
+---------+------+
I tried the following code and obtained the result below which is not what I'm looking for
library(dplyr)
df2 = df %>%
group_by(country) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = country, values_from = year) %>%
select(-row)
df2
+------+--------+--------+
| USA | Israel | Sweden |
| 1995 | 1991 | 1991 |
| NA | 1993 | 1992 |
| NA | 1995 | 1993 |
| NA | NA | 1994 |
| NA | NA | 1995 |
+------+--------+--------+
You can try this:
library(dplyr)
library(tidyr)
df %>% mutate(val=1) %>% pivot_wider(names_from = country,values_from = val) %>%
mutate(across(-year, ~replace_na(.x, 0))) %>%
mutate(across(-year, ~ifelse(.x==1, TRUE,FALSE)))
Output:
# A tibble: 5 x 4
year Sweden Israel USA
<fct> <lgl> <lgl> <lgl>
1 1991 TRUE TRUE FALSE
2 1992 TRUE FALSE FALSE
3 1993 TRUE TRUE FALSE
4 1994 TRUE FALSE FALSE
5 1995 TRUE TRUE TRUE
here is a data.table solution
library( data.table )
#custom function, odetermins is the length of a vector >1 (TRUE/FALSE)
cust_fun <- function(x) length(x) > 0
#cast to wide, aggregating with the custom function above
dcast( setDT(df), year ~ country, fun.aggregate = cust_fun )
# year Israel Sweden USA
# 1: 1991 TRUE TRUE FALSE
# 2: 1992 FALSE TRUE FALSE
# 3: 1993 TRUE TRUE FALSE
# 4: 1994 FALSE TRUE FALSE
# 5: 1995 TRUE TRUE TRUE

Split a row into columns with conditions in R

I've a dataframe as under
+----+-------+---------+
| ID | VALUE | DATE |
+----+-------+---------+
| 1 | 10 | 2019-08 |
| 2 | 12 | 2018-05 |
| 3 | 45 | 2019-03 |
| 3 | 33 | 2018-03 |
| 1 | 5 | 2018-08 |
| 2 | 98 | 2019-05 |
| 4 | 67 | 2019-10 |
| 4 | 34 | 2018-10 |
| 1 | 55 | 2018-07 |
| 2 | 76 | 2019-08 |
| 2 | 56 | 2018-12 |
+----+-------+---------+
What I'm trying to do here is to split the value and date into value1 and value2 and data1 and date2 based on the current year(year of systemdate) and the year before
But the condition here is if the date-month combination in DATE of the main table matched to that of current systemdate then donot consider last years date
Also disregard all the values dates that appear before the year of systemdate
The resulting output would be as under
Over here in the result ID 1,2 and 3 had corresponding values for same month in this year and last year so we split them in 2 different columns
Also we didn't consider last years result of ID 4 as its month this year matches with year-month combination of systemdate
and we also disregard all the values from lat year that don't have a corresponding month match this year ( ID 1 for 2018-07 and 2 for 2018-12 in this example)
+----+---------+---------+--------+--------+
| ID | DATE1 | DATE2 | VALUE1 | VALUE2 |
+----+---------+---------+--------+--------+
| 1 | 2019-08 | 2018-08 | 10 | 5 |
| 2 | 2019-05 | 2018-05 | 98 | 12 |
| 3 | 2019-03 | 2018-03 | 45 | 33 |
| 4 | 2019-10 | NA | 67 | NA |
| 2 | 2019-08 | NA | 76 | NA |
+----+---------+---------+--------+--------+
I think you could get everything in the right format first:
df <- data.frame(ID = c(1, 2, 3, 3, 1, 2, 4, 4, 1, 2, 2),
VALUE = c(10, 12, 45, 33, 5, 98, 67, 34, 55, 76, 56),
DATE = c("2019-08", "2018-05", "2019-03","2018-03",
"2018-08","2019-05", "2019-10", "2018-10",
"2018-07", "2019-08", "2018-12"))
library(tidyverse)
df <- df %>% mutate(
year = str_split_fixed(DATE, "-", 2)[,1],
month = str_split_fixed(DATE, "-", 2)[,2]) %>%
pivot_wider(
names_from = year,
values_from = c(VALUE, DATE))
Then, you could filter and remove those values that you do not need according to your logic. I may not fully understand your system time here, but just assume it is the string "2019-10". It could be something like this:
df %>%
filter(!is.na(VALUE_2019)) %>%
mutate(
VALUE_2018 = ifelse(DATE_2019 == "2019-10", NA, VALUE_2018),
DATE_2018 = ifelse(DATE_2019 == "2019-10", NA, as.character(DATE_2018)))
# A tibble: 5 x 6
ID month VALUE_2019 VALUE_2018 DATE_2019 DATE_2018
<dbl> <chr> <dbl> <dbl> <fct> <chr>
1 1 08 10 5 2019-08 2018-08
2 2 05 98 12 2019-05 2018-05
3 3 03 45 33 2019-03 2018-03
4 4 10 67 NA 2019-10 NA
5 2 08 76 NA 2019-08 NA

Populating column based on row matches without for loop

Is there a way to obtain the annual count values based on the state, species, and year, without using a for loop?
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | ?
Dora | CA | 2 | Regal Tang | ?
Lookup table:
State | Species | Year | AnnualCt
NY | Clownfish | 2012 | 500
NY | Clownfish | 2014 | 200
CA | Regal Tang | 2001 | 400
CA | Regal Tang | 2014 | 680
CA | Regal Tang | 2000 | 700
The output would be:
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | 200
Dora | CA | 2 | Regal Tang | 680
What I've tried:
pets <- data.frame("Name" = c("Nemo","Dora"), "State" = c("NY","CA"),
"Age" = c(5,2), "Species" = c("Clownfish","Regal Tang"))
fishes <- data.frame("State" = c("NY","NY","CA","CA","CA"),
"Species" = c("Clownfish","Clownfish","Regal Tang",
"Regal Tang", "Regal Tang"),
"Year" = c("2012","2014","2001","2014","2000"),
"AnnualCt" = c("500","200","400","680","700"))
pets["AnnualCt"] <- NA
for (row in (1:nrow(pets))){
pets$AnnualCt[row] <- as.character(droplevels(fishes[which(fishes$State == pets[row,]$State &
fishes$Species == pets[row,]$Species &
fishes$Year == 2014),
which(colnames(fishes)=="AnnualCt")]))
}
I'm confused as to what you're trying to do; isn't this just this?
library(dplyr);
left_join(pets, fishes) %>%
filter(Year == 2014) %>%
select(-Year);
#Joining, by = c("State", "Species")
# Name State Age Species AnnualCt
#1 Nemo NY 5 Clownfish 200
#2 Dora CA 2 Regal Tang 680
Explanation: left_join both data.frames by State and Species, filter for Year == 2014 and output without Year column.

Remove data from DF based on multiple criteria

I have a large data frame (df) that looks something like the following sample. There are a number of data entry errors in the data set and I need to remove these. In the sample data all NSW States should have a Postcode starting with 2. All VIC States should have a Postcode starting with 3.
| Suburb | State | Postcode |
| ------ | ----- | -------- |
| FLEMINGTON | NSW | 2140 |
| FLEMINGTON | NSW | 2144 |
| FLEMINGTON | NSW | 3996 |
| FLEMINGTON | VIC | 2996 |
| FLEMINGTON | VIC | 3021 |
| FLEMINGTON | VIC | 3031 |
I need the final table to look like...
| Suburb | State | Postcode |
| ------ | ----- | -------- |
| FLEMINGTON | NSW | 2140 |
| FLEMINGTON | NSW | 2144 |
| FLEMINGTON | VIC | 3021 |
| FLEMINGTON | VIC | 3031 |
The following solution is kind of close, but I don't know how to filter for integers starting with a specific number and am under time pressure.
Extracting rows from df based on multiple conditions in R
Any help would be greatly appreciated.
To make this easily extended on, do it as a merge operation against only your acceptable values for each state:
merge(
transform(dat, Pc1=substr(Postcode,1,1)),
data.frame(State=c("NSW","VIC"),Pc1=c("2","3"))
)
# State Pc1 Suburb Postcode
#1 NSW 2 FLEMINGTON 2140
#2 NSW 2 FLEMINGTON 2144
#3 VIC 3 FLEMINGTON 3021
#4 VIC 3 FLEMINGTON 3031
Try this? If your Postcode are integers & these are the only conditions, it should be pretty straightforward:
df <- data.frame(Suburb = rep("FLEMINGTON", 6),
State = c(rep("NSW", 3), rep("VIC", 3)),
Postcode = c(2140,2144,3996,2996,3021,3031))
library(dplyr)
df <- df %>%
filter((State == "NSW" & Postcode < 3000) | (State == "VIC" & Postcode >= 3000))
> df
Suburb State Postcode
1 FLEMINGTON NSW 2140
2 FLEMINGTON NSW 2144
3 FLEMINGTON VIC 3021
4 FLEMINGTON VIC 3031

Resources