I have this following dataset:
df <- structure(list(Data = structure(c(1623888000, 1629158400, 1629158400
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Client = c("Client1",
"Client1", "Client1"), Fund = c("Fund1", "Fund1", "Fund2"), Nature = c("Application",
"Rescue", "Application"), Quantity = c(433.059697, 0, 171.546757
), Value = c(69800, -70305.67, 24875), `NAV Yesterday` = c(162.40991399996,
162.40991399996, 145.044589000056), `NAV in Application Date` = c(161.178702344125,
162.346370458944, 145.004198476337), `Var NAV` = c(0.00763879866215962,
0.00039140721678275, 0.000278547270652531), `Var * Value` = c(533.188146618741,
-27.5181466187465, 6.92886335748171), FinalValue = c(70333.1881466187,
-70333.1881466187, 24881.9288633575), `Rentability WRONG` = c(0.0210345899274819,
0.0210345899274819, 0.0210345899274819)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
What I need to do is:
If quantity = 0, then remove all rows with the same Fund name as that one, but remove only the rows that have Date < or = Date of the Quantity = 0 Fund
What I did here is:
I grouped the data by Fund
Arranged each group by Data
Created a column zero_point that assigns 1 to the row where Quantity == 0 and NA otherwise
Filled the fields in zero_point that come before the actual "zero point" with the same value.
filtered those rows out.
output <- df %>%
group_by(Fund) %>%
arrange(Data) %>%
mutate(zero_point = case_when(Quantity == 0 ~ 1)) %>%
fill(zero_point, .direction = "up") %>%
filter(is.na(zero_point))
(On the condition that there is only one instance where Quantity is 0 per Fund group)
You can try -
library(dplyr)
df %>%
filter({
#Row index where Quantity = 0
inds = which(Quantity == 0)
#Drop rows where Data value is less than Data value at Quantity = 0
#and Fund is same as present at Quantity = 0.
!(Data <= Data[inds] & Fund %in% Fund[inds])
})
Here's a thought:
df %>%
group_by(Fund) %>%
filter(!any(Quantity == 0) | Data <= Data[which.min(Quantity)])
# # A tibble: 3 x 12
# # Groups: Fund [2]
# Data Client Fund Nature Quantity Value `NAV Yesterday` `NAV in Applica~ `Var NAV` `Var * Value` FinalValue `Rentability WR~
# <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2021-06-17 00:00:00 Clien~ Fund1 Appli~ 433. 69800 162. 161. 0.00764 533. 70333. 0.0210
# 2 2021-08-17 00:00:00 Clien~ Fund1 Rescue 0 -70306. 162. 162. 0.000391 -27.5 -70333. 0.0210
# 3 2021-08-17 00:00:00 Clien~ Fund2 Appli~ 172. 24875 145. 145. 0.000279 6.93 24882. 0.0210
I'm assuming you meant "Data <= Data of the Quantity = 0 Fund", therefore using Data instead of Date (not found) or NAV in Application Date.
This filters nothing in this sample data, I'm hoping the logic is correct.
Testing for equality with floating-point (numeric) can be problematic at times (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). If you have some small near-zero numbers, then this will silently produce counter-intuitive results without warning or error. You might be more defensive to use something like:
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) | Data <= Data[which.min(Quantity)])
or even
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) |
row_number() == which.min(Quantity) |
Data < Data[which.min(Quantity)])
While the latter is a bit paranoid (and double-calculates which.min(.), it should not succumb to problems with equality tests.
The only time this will fail is if all(is.na(Quantity)); that is, which.min(c(NA,NA)) returns integer(0) which will cause an error in dplyr::filter. One might choose to add safeguard with something like filter(any(!is.na(Quantity)) & (...)).
I've got a dataset that looks like this
df = data.frame(Site = c(rep('w',4),rep('x',5),rep('y',2),rep('z',1)),
Parent = c(rep('W Inc.',4),rep('X Inc.',5),rep('Y Inc.',2),rep('Z Inc.',1)),
Status = c(rep('Prospect',4),rep('Client',5),rep('Client',2),rep('Prospect',1)),
Country = c(rep('USA',4),rep('Canada',5),rep('Mexico',2),rep('China',1)),
ProductID = c('XP10','XP11','XP18','XP19','XP4','XP5','XP6','XP7','XP8','XP10','XP18','XP6'),
ProductName = c('10Rockets','11Rockets','18Rockets','19Rockets','4Rockets','5Rockets','6Rockets','7Rockets','8Rockets','10Rockets','18Rockets','6Rockets'),
ProductProvider= c('Provider A','Provider B','Provider A','Provider A',rep('Provider A',5),'Provider A','Provider B','Provider B'))
I'd like to condense it such that each Site is a unique row, and the last 3 columns are concatenated.
Also, I'd like to concatenate the last column such that if there are any repetitions, it takes only the unique values per Site and separates them with commas.
My attempt
library(dplyr)
output2 = df %>% group_by(Site,Parent,Status,Country) %>%
mutate(ProductID = paste(ProductID, collapse=",")) %>%
mutate(ProductName = paste(ProductName, collapse=",")) %>%
mutate(ProductProvider = unique(paste(ProductProvider, collapse=","))) %>%
distinct()
I'm almost there, but the last column seems to have repetitions of ProductProvider which I do not want.
Target Output
I'm looking for a target data set like this, with the last column concatenated and free of any repetitions. Any inputs would be appreciated.
output = data.frame(Site = c(rep('w',1),rep('x',1),rep('y',1),rep('z',1)),
Parent = c(rep('W Inc.',1),rep('X Inc.',1),rep('Y Inc.',1),rep('Z Inc.',1)),
Status = c(rep('Prospect',1),rep('Client',1),rep('Client',1),rep('Prospect',1)),
Country = c(rep('USA',1),rep('Canada',1),rep('Mexico',1),rep('China',1)),
ProductID = c('XP10,XP11,XP18,XP19','XP4,XP5,XP6,XP7,XP8','XP10,XP18','XP6'),
ProductName = c('10Rockets,11Rockets,18Rockets,19Rockets','4Rockets,5Rockets,6Rockets,7Rockets,8Rockets','10Rockets,18Rockets','6Rockets'),
ProductProvider= c('Provider A,Provider B','Provider A','Provider A,Provider B','Provider B'))
With dplyr:
library(dplyr)
result = df %>% group_by(Site, Parent, Status, Country) %>%
summarize(across(ProductProvider, ~paste(unique(.), collapse = ", ")),
across(everything(), paste, collapse = ", "))
result
# # A tibble: 4 x 7
# # Groups: Site, Parent, Status [4]
# Site Parent Status Country ProductProvider ProductID ProductName
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 w W Inc. Prospect USA Provider A, Provider B XP10, XP11, XP18, XP19 10Rockets, 11Rockets, 18Rockets, 19Rockets
# 2 x X Inc. Client Canada Provider A XP4, XP5, XP6, XP7, XP8 4Rockets, 5Rockets, 6Rockets, 7Rockets, 8Rockets
# 3 y Y Inc. Client Mexico Provider A, Provider B XP10, XP18 10Rockets, 18Rockets
# 4 z Z Inc. Prospect China Provider B XP6 6Rockets
Short and sweet with aggregate.
aggregate(. ~ Site, df, unique)
# Site Parent Status Country ProductID ProductName ProductProvider
# 1 w W Inc. Prospect USA XP10, XP11, XP18, XP19 10Rockets, 11Rockets, 18Rockets, 19Rockets Provider A, Provider B
# 2 x X Inc. Client Canada XP4, XP5, XP6, XP7, XP8 4Rockets, 5Rockets, 6Rockets, 7Rockets, 8Rockets Provider A
# 3 y Y Inc. Client Mexico XP10, XP18 10Rockets, 18Rockets Provider A, Provider B
# 4 z Z Inc. Prospect China XP6 6Rockets Provider B
I have the following problem. I have a data frame/tibble that has (a lot) of columns that represent a value in different years, e.g. the number of inhabitants in a city at different points in time. I want to generate now columns that give me the growth rate (see pictures attached). It should be something like using mutate() while looping over the columns. I think that should be a common task but I can't find any hint how to do it.
Edit:
A minimal example could look like this:
## Minimal example
library(tidyverse)
## Given data frame
df <- tibble(
City = c("Melbourne", "Sydney", "Adelaide"),
year_2000 = c(100, 100, 205),
year_2001 = c(101, 100, 207),
year_2002 = c(102, 100, 209)
)
## Result
df <- df %>%
mutate(
gr_2000_2001 = year_2001/year_2000*100 - 100,
gr_2001_2002 = year_2002/year_2001*100 - 100
)
I want to find a way to automate/do the mutate command in a smart way, as I have to do it for 150 years.
enter image description here
enter image description here
The easiest way in this example would probably be to make your data tidy and then apply whatever formula you are using to calculate growth rates by using dplyr's lag()function to a data frame grouped by City:
## Minimal example
library(tidyverse)
df <- data.frame(City = c("Melbourne", "Sydney"),
year_2000 = c(100, 100),
year_2001 = c(101,100),
year_2002 = c(102, 102))
df %>%
gather(year, value, 2:4) %>%
group_by(City) %>%
mutate(growth = value/dplyr::lag(value,n=1))
The result is this:
# A tibble: 6 x 4
# Groups: City [2]
City year value growth
<fct> <chr> <dbl> <dbl>
1 Melbourne year_2000 100 NA
2 Sydney year_2000 100 NA
3 Melbourne year_2001 101 1.01
4 Sydney year_2001 100 1
5 Melbourne year_2002 102 1.01
6 Sydney year_2002 102 1.02
If you absolutely need the data in the format you provided in the screenshots, you can then apply spread() to reshape it into the original format. This is not generally recommended, however.
I need help regarding the data manipulation in R .
My dataset looks something like this.
Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Now I want to identify the number of changes that “Smith” has gone through (two) based on combination of Name and country only.
Basically, I want to compare each datapoint with other datapoints and identify the count of changes that have been there in the entire dataset for the combination of Name and Country only.
You can do this by comparing the lag values of the combination with it's current value by group using dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(combination = paste(country, age),
lag_combination = lag(combination, default = 0, order_by = Name),
Changes = cumsum(combination != lag_combination)) %>%
slice(n()) %>%
select(Name, Changes)
Result:
# A tibble: 3 x 2
# Groups: Name [3]
Name Changes
<fctr> <int>
1 Avin 2
2 Robin 1
3 Smith 3
Notes:
dplyr:lag does not respect group_by(Name), so you need to add order_by = Name to lag by Name.
I'm setting a default using default = 0 so that the first entry of each group would not be NA.
Data:
df = read.table(text="Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Smith, Canada, 27
Robin, France, 28
Avin, France, 26", header = TRUE, sep = ',')
This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Select the row with the maximum value in each group
(19 answers)
Closed 5 years ago.
I have two data frames: City and Country. I am trying to find out the most popular city per country. City and Country have a common field, City.CountryCode and Country.Code. These two data frames were merged to one called CityCountry. I have tried the aggregate command like so:
aggregate(Population.x~CountryCode, CityCountry, max)
This aggregate command only shows the CountryCode and Population.X columns. How would I show the name of the Country and the name of the City? Is aggregate the wrong command to use here?
Could also use dplyr to group by Country, then filter by max(Population.x).
library(dplyr)
set.seed(123)
CityCountry <- data.frame(Population.x = sample(1000:2000, 10, replace = TRUE),
CountryCode = rep(LETTERS[1:5], 2),
Country = rep(letters[1:5], 2),
City = letters[11:20],
stringsAsFactors = FALSE)
CityCountry %>%
group_by(Country) %>%
filter(Population.x == max(Population.x)) %>%
ungroup()
# A tibble: 5 x 4
Population.x CountryCode Country City
<int> <chr> <chr> <chr>
1 1287 A a k
2 1789 B b l
3 1883 D d n
4 1941 E e o
5 1893 C c r