conditional subtraction in dataframe in R - r

I have the following data frame.
ID Year
A 2001
A 2002
A 2003
B 2009
B 2010
I would like to create a third column in which I substract the minimum year of the corresponding ID to the year and then add one.
In short, I would like to have this:
ID Year New
A 2001 1
A 2002 2
A 2003 3
B 2009 1
B 2010 2
I am pretty new to R and dplyr and havent found the way to do that without a loop..
Thank you in advance

In dplyr you need to use group_by and mutate like so:
df <- read.table(text = "ID Year
A 2001
A 2002
A 2003
B 2009
B 2010", header = T)
df <- df %>%
group_by(ID) %>%
mutate(New = Year - min(Year) + 1)
df
# ID Year New
# A 2001 1
# A 2002 2
# A 2003 3
# B 2009 1
# B 2010 2

Using the tidyverse:
library(tidyverse)
data <- tribble(~ID, ~year,
"A", 2001,
"A", 2002,
"A", 2003,
"B", 2009,
"B", 2010
)
data %>% group_by(ID) %>%
mutate(new = year - min(year)+1)

Using ddply:
library(plyr)
df<-data.frame(ID=c("A","A","A","B","B"), Year=c(2001,2002,2003,2009,2010))
ddply(df, .(ID), transform, New=Year-min(Year)+1)

Related

Overwrite a specific value in a dataframe, based on matching values

My data is in a format like this:
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 90
df <- data.frame(stringsAsFactors=FALSE,
country = c("AUS", "USA", "AUS"),
year = c(2019, 2019, 2018),
value = c(100, 120, 90)
)
and I have an one row dataframe that represents a revision that should overwrite the existing record in my data.
#> country year value
#> 1 AUS 2019 500
df2 <- data.frame(stringsAsFactors=FALSE,
country = c("AUS"),
year = c(2018),
value = c(500)
)
My desired output is:
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 500
I know how to find the row to overwrite:
library(tidyverse)
df %>% filter(country == overwrite$country & year == overwrite$year) %>%
mutate(value = overwrite$value)
but how do I put that back in the original dataframe?
Tidyverse answers are easier for me to work with, but I'm open to any solutions.
Using mutate and if_else:
library(tidyverse)
df %>%
mutate(value = if_else(country %in% df2$country & year %in% df2$year, df2$value, value))
Results in:
country year value
1 AUS 2019 100
2 USA 2019 120
3 AUS 2018 500
Here, an efficient approach is join on with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), join on with the 'df2' on 'country', 'year' assign (:=) the 'value' column from second dataset (i.value) to replace the 'value' in original dataset
library(data.table)
setDT(df)[df2, value := i.value, on = .(country, year)]
df
# country year value
#1: AUS 2019 100
#2: USA 2019 120
#3: AUS 2018 500
One possible tidyverse approach using 1). anti_join to remove the rows from df that will be replaced and 2). bind_rows to add the replacement rows from df2:
library(dplyr)
anti_join(df, df2, by = c("country", "year")) %>% bind_rows(df2)
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 500
Or, another one using 1). right_join to join the old and new values and 2). coalesce to keep only the new values:
right_join(df2, df, by = c("country", "year")) %>%
transmute(country, year, value = coalesce(value.x, value.y))
#> country year value
#> 1 AUS 2019 100
#> 2 USA 2019 120
#> 3 AUS 2018 500

Provide a list of names with consecutive values in another column

I have a large dataset with company names and years:
2001 company 1
2002 company 1
2003 company 1
2004 company 1
2001 company 2
2002 company 2
2001 company 3
2003 company 3
2004 company 3
I need to write a function that will, given years n and m, provide me a list of companies which have corresponding consecutive year values beginning at year n and ending at year m.
For example, in the above case, f(2001, 2002) would show:
2001 company 1
2002 company 1
2001 company 2
2002 company 2
It could also provide only the company names. f(2001, 2003) would show only companies 1 and 2 since company 3 skips 2002.
You can also wrap a few dplyr functions into your function to get the desired results.
library(dplyr)
company_func <- function(data = data, year_1, year_2){
#filter dataset to years of interest
data <- data %>% filter(Year >= year_1 & Year <= year_2)
#sort by company and year
data <- data %>% arrange(Company, Year)
#calc difference in years for each company
data <- data %>% group_by(Company)
%>% mutate("year_diff" = Year - lag(Year, default = min(Year)))
#filter to only comp with consecutive years
data.filter <- data %>% filter(year_diff == 1)
data <- data %>% filter(Company %in% data.filter$Company) %>%
select(Company, Year)
return(data)
}
The results:
company_func(data, 2001, 2002)
Company Year
1 company 1 2001
2 company 1 2002
3 company 2 2001
4 company 2 2002
company_func(data, 2001, 2003)
Company Year
1 company 1 2001
2 company 1 2002
3 company 1 2003
4 company 2 2001
5 company 2 2002
Try this:
year1 = value of year1 (start year)
year2 = value of year2 (end year)
df = the data frame which has these values
companies_func <- function(year1, year2, df)
{
return (df[(df$year >= year1) & (df$year <= year2)])
}
print(companies_func(2001, 2002, df))
year company
1: 2001 company1
2: 2002 company1
3: 2001 company2
4: 2002 company2
5: 2001 company3
I would use the data.table package instead of a function
years = c(2001, 2002) #vector with your years
dt <- as.data.table(df) #convert the table to a data.table
dt[year %in% years]
EDIT:
I misread your problem. If you want a range of years I would do it like this:
years = c(2001:2003) #vector with your years, with starting and endpoint
dt <- as.data.table(df) #convert the table to a data.table
dt[year %in% years]
Here is a solution with data.table:
library("data.table")
dt <- fread(
"year company
2001 company1
2002 company1
2003 company1
2004 company1
2001 company2
2002 company2
2001 company3
2003 company3
2004 company3")
years <- 2001:2002
dt[, if (all(years %in% year)) company, company][,1]
# dt[, if (all(years %in% year)) company, company][, company] # if you want a vector of char
This will give you the names of the companies which have the full sequence of years:
# > dt[, if (all(years %in% year)) company, company][,1]
# company
# 1: company1
# 2: company2
If you want to define a function, you can do:
f <- function(DT, from, to) {
years <- from:to
DT[, if (all(years %in% year)) company, company][,1]
}
f(dt, 2001, 2002)

R: creating new variable with conditions using dplyr

Hi I am trying to create a new variable with dplyr.
My data looks like the following:
Land happy year
<fctr> <int> <dbl>
1 Country1 09 2002
2 Country1 08 2012
3 Country3 05 2008
...
To create a variable with the mean of happy per Land and year, I used this code:
New <-df %>%
group_by(Land, year) %>%
mutate(mean.happy = mean(happy, na.rm=T))
Now I would like to make a variable with this content:
(mean of happy in 2012)- (mean of happy in 2008) for each Country.
How can I build a new variable with these conditions?
Here's a dplyr/tidyr solution.
library(dplyr)
library(tidyr)
df <- df %>%
group_by(Land, year) %>%
mutate(mean.happy = mean(happy, na.rm=T)) %>%
spread(year, mean.happy)
Here's a data.table solution. It's typically faster
library(data.table)
dt = read.table("clipboard", header = TRUE)
setDT(dt)
dt[ , "mean.happy" := mean(happy), by = .(Land, year)]
dt[ , "diff.happiness" := mean(happy[year == 2012]) - mean(happy[year == 2008])]
> dt
Land happy year mean.happy diff.happiness
1: Country1 9 2002 9 3
2: Country1 8 2012 8 3
3: Country3 5 2008 5 3

R Add rows while reshaping a data frame [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 1 year ago.
I have a similar data frame as df that looks like a registry of entries and exits in a system.
df = data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
> df
id entry exit
1 A 2011 2013
2 B 2014 2015
My aim is to represent my df in long format. gather() from tidyr enables to do something like this.
df_long = df %>% gather(registry, time, entry:exit) %>% arrange(id)
> df_long
id registry time
1 A entry 2011
2 A exit 2013
3 B entry 2014
4 B exit 2015
Yet, I am stuck on how I could incorporate additional rows that would represent the time that my observations (id) are effectively in the system. My desired data.frame then would look something like this:
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2013
5 B 2014
6 B 2015
Any idea of how I could do this is more than welcome and really appreciated.
Here's a way to get toward your desired solution:
df1 <- data.frame(id = c("A", "B"), entry = c(2011, 2014), exit = c(2013, 2015))
setNames(stack(by(df1, df1$id, function(x) x$entry : x$exit))[,c(2,1)],
c('id','time'))
id time
1 A 2011
2 A 2012
3 A 2013
4 B 2014
5 B 2015
UPDATE: Another solution based on plyr incorporating the comment above could be:
df1 <- data.frame(id = c("A", "B"), region = c("country.1", "country.2"), entry = c(2011, 2014), exit = c(2013, 2015))
library(plyr)
ddply(df1, .(id,region), summarize, time=seq(entry, exit))
That yields:
id region time
1 A country.1 2011
2 A country.1 2012
3 A country.1 2013
4 B country.2 2014
5 B country.2 2015

How to get the maximum value by group

I have a data.frame with two columns: year and score. The years go from 2000-2012 and each year can be listed multiple times. In the score column I list all the scores for each year with each row having a different score.
What I'd like to do is filter the data.frame so only the rows with the maximum scores for each year remain.
So as a tiny example if I have
year score
2000 18
2001 22
2000 21
I would want to return just
year score
2001 22
2000 21
If you know sql this is easier to understand
library(sqldf)
sqldf('select year, max(score) from mydata group by year')
Update (2016-01): Now you can also use dplyr
library(dplyr)
mydata %>% group_by(year) %>% summarise(max = max(score))
using plyr
require(plyr)
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25))
ddply(df, .(year), summarise, max.score=max(score))
using data.table
require(data.table)
dt <- data.table(df, key="year")
dt[, list(max.score=max(score)), by=year]
using aggregate:
o <- aggregate(df$score, list(df$year) , max)
names(o) <- c("year", "max.score")
using ave:
df1 <- df
df1$max.score <- ave(df1$score, df1$year, FUN=max)
df1 <- df1[!duplicated(df1$year), ]
Edit: In case of more columns, a data.table solution would be the best (my opinion :))
set.seed(45)
df <- data.frame(year=sample(2000:2012, 25, replace=T), score=sample(25),
alpha = sample(letters[1:5], 25, replace=T), beta=rnorm(25))
# convert to data.table with key=year
dt <- data.table(df, key="year")
# get the subset of data that matches this criterion
dt[, .SD[score %in% max(score)], by=year]
# year score alpha beta
# 1: 2000 20 b 0.8675148
# 2: 2001 21 e 1.5543102
# 3: 2002 22 c 0.6676305
# 4: 2003 18 a -0.9953758
# 5: 2004 23 d 2.1829996
# 6: 2005 25 b -0.9454914
# 7: 2007 17 e 0.7158021
# 8: 2008 12 e 0.6501763
# 9: 2011 24 a 0.7201334
# 10: 2012 19 d 1.2493954
using base packages
> df
year score
1 2000 18
2 2001 22
3 2000 21
> aggregate(score ~ year, data=df, max)
year score
1 2000 21
2 2001 22
EDIT
If you have additional columns that you need to keep, then you can user merge with aggregate to get those columns
> df <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21) , hrs = c( 10, 11, 12))
> df
year score hrs
1 2000 18 10
2 2001 22 11
3 2000 21 12
> merge(aggregate(score ~ year, data=df, max), df, all.x=T)
year score hrs
1 2000 21 12
2 2001 22 11
data <- data.frame(year = c(2000, 2001, 2000), score = c(18, 22, 21))
new.year <- unique(data$year)
new.score <- sapply(new.year, function(y) max(data[data$year == y, ]$score))
data <- data.frame(year = new.year, score = new.score)
one liner,
df_2<-data.frame(year=sort(unique(df$year)),score = tapply(df$score,df$year,max));

Resources