Create a local id for a combination of 2 columns [duplicate] - r

This question already has answers here:
R - add column that counts sequentially within groups but repeats for duplicates
(3 answers)
Closed 7 years ago.
I have a dataset I wish to process, and instead of processing it as a time series, I want to summarize the time behaviour. Here is the dataset:
business_id year
vcNAWiLM4dR7D2nwwJ7nCA 2007
vcNAWiLM4dR7D2nwwJ7nCA 2007
vcNAWiLM4dR7D2nwwJ7nCA 2009
UsFtqoBl7naz8AVUBZMjQQ 2004
UsFtqoBl7naz8AVUBZMjQQ 2005
cE27W9VPgO88Qxe4ol6y_g 2007
cE27W9VPgO88Qxe4ol6y_g 2007
cE27W9VPgO88Qxe4ol6y_g 2008
cE27W9VPgO88Qxe4ol6y_g 2010
I want to turn it into this:
business_id year yr_id
vcNAWiLM4dR7D2nwwJ7nCA 2007 1
vcNAWiLM4dR7D2nwwJ7nCA 2007 1
vcNAWiLM4dR7D2nwwJ7nCA 2009 2
UsFtqoBl7naz8AVUBZMjQQ 2004 1
UsFtqoBl7naz8AVUBZMjQQ 2005 2
cE27W9VPgO88Qxe4ol6y_g 2007 1
cE27W9VPgO88Qxe4ol6y_g 2007 1
cE27W9VPgO88Qxe4ol6y_g 2008 2
cE27W9VPgO88Qxe4ol6y_g 2010 3
In other words, I want the ID to be sequential to the year, but local to the business_id, so that it resets when the program finds another business_id.
Is this something that is easily achievable in R?

I found this other question in SO, and the answer effectively answers this question, so this should be marked as duplicate.
https://stackoverflow.com/a/27896841/4858065
The way to achieve this is:
df %>% group_by(business_id) %>%
mutate(year_id = dense_rank(year))

Related

How can I arrange a group within a data frame based on year?

I have a data frame ("df") which I want to order based on year for a specific group based on Ticker.
year
Ticker
at
2009
FLWS
286.127
2003
FLWS
214.796
2007
FLWS
352.507
2008
FLWS
371.338
2004
FLWS
261.552
2005
FLWS
251.952
2010
FLWS
256.086
2011
FLWS
256.951
2006
FLWS
346.634
2007
SRCE
4447.104
2009
SRCE
4542.100
2003
SRCE
3330.153
2010
SRCE
4445.281
2011
SRCE
4374.071
2005
SRCE
3511.277
I want to have the data frame in order of year (ascending) for each group of Ticker. I've tried using base R (order) and the dplyr package (group_by, arrange) but I am a complete newbie to any sort of coding so needless to say I have been struggling.

Functions on a Matrix in R

Lets say I have a dataset with a column representing the years.
Years
2007
2008
2009
2011
2015
I want to subtract the row with the row below it and save the ans to a new column. such as for above data I want to make a function that subtracts 2008 to 2007, the ans is 1 and save this ans to a new column, the next would be 2009 - 2008, 2011 - 2009. the resulting matrix should look like
Year Gap
2007 1
2008 1
2009 2
2011 4
2015 .
and so on
How can I make a function in R that will do this for me?

return the years for which the values are NA for specific countries in r [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and couldn't find a solution to this.
I have a data set with Country Codes, Values, and Years(Panel Data)
The 'Value' column has many NAs.
I would like to, for each country, get a list of years for which the values are NA.
Would this be possible using the dplyr function? This is a snapshot of my data set
Country codes, Years and Values
Making the test case:
df <- read.table(text="Country Year Value
UKR 2006 NA
UKR 2007 NA
UKR 2008 2000
ARE 2006 NA
ARE 2007 NA", header=TRUE)
for each country, get a list of years for which the values are NA
lapply(split(df, df["Country"]), function(x) x$Year[is.na(x$Value)])
# or equivalent but more readable
with(subset(df, is.na(Value)), split(Year, Country))
Output:
$ARE
[1] 2006 2007
$UKR
[1] 2006 2007
Is this what you need?
Use the which function:
df[is.na(which(df$value)),]
Do you mean like this?
DAT = read.table(text="Country.Code Year Value
UKR 2006 NA
UKR 2007 NA
UKR 2008 2000
ARE 2006 NA
ARE 2007 NA",
header=TRUE)
DAT[is.na(DAT$Value), 1:2]
Country.Code Year
1 UKR 2006
2 UKR 2007
4 ARE 2006
5 ARE 2007
Addition
To get all years for one country in a single line, you could use
temp = DAT[is.na(DAT$Value), 1:2]
aggregate(temp$Year, list(temp$Country.Code), paste, collapse=",")
Group.1 x
1 ARE 2006,2007
2 UKR 2006,2007

R counter, counting frequency in a table [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Add column with order counts
(2 answers)
Closed 6 years ago.
I have following data set
id year
2 20332 2005
3 6383 2005
14 20332 2006
15 6806 2006
16 23100 2006
I would like to have an additional column, which counts the number of years the id variable is already available:
id year Counter
2 20332 2005 1
3 6383 2005 1
14 20332 2006 2
15 6806 2006 1
16 23100 2006 1
The dataset is currently not sorted according to the year. I thought about mutate rather than a function.
Any ideas? Thanks!
We can use ave from base R
df1$Counter <- with(df1, ave(id, id, FUN = seq_along))

Simple filtering in R, but with more than one value

I am well aware of how to extract some data based on a condition, but whenever I try multiple conditions, a struggle ensues. I have some data and I only want to extract certain years from the df. Here is an example df:
year value
2006 3
2007 4
2007 3
2008 5
2008 4
2008 4
2009 5
2009 9
2010 2
2010 8
2011 3
2011 8
2011 7
2012 3
2013 4
2012 6
Now let's say I just want 2008, 2009, 2010, and 2011. I try
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
doesn't work, so then:
df<-df[df$year == "2008" & df$year == "2009"
& df$year == "2010" & df$year == "2011",]
No error messages, just an empty df. What am I missing?
You need to use %in% and not==
df[df$year %in% c(2008, 2009, 2010, 2011),]
year value
4 2008 5
5 2008 4
6 2008 4
7 2009 5
8 2009 9
9 2010 2
10 2010 8
11 2011 3
12 2011 8
13 2011 7
As answered %in% works but so should using |. The & is for AND logic, meaning that the year would need to be equal to 2008, 2009, 2010 AND 2011 whereas what you want is the OR operator.
df<-df[df$year == "2008" | df$year == "2009" | df$year == "2010" | df$year == "2011",]
If you don't like %in%, try the function is.element. You might find it more intuitive.
df[is.element(el=df[,"year"], set=c(2008:2011)),]
Careful, though... switching el and set gives different results, and it can be confusing which way you want it. For this example, just remember that "set" contains the "subSET" of years that you want.
The questions has been answered but I wanted to add a comment about why your first try gives an unexpected result. This is a good example of R's vector recycling.
I'm guessing you got
year value
6 2008 4
13 2011 8
Why has R done this? What happens is R recycles the vector c("2008", "2009", "2010", "2011") like the below.
year value compare
2006 3 2008
2007 4 2009
2007 3 2010
2008 5 2011
2008 4 2008
2008 4 2009
2009 5 2010
2009 9 2011
2010 2 2008
2010 8 2009
2011 3 2010
2011 8 2011
2011 7 2008
2012 3 2009
2013 4 2010
2012 6 2011
Do you see what's about to happen? When you run
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
it will return the rows where the year column and the compare column are equal. You didn't get a warning because (by chance) your comparison vector was a divisor of the number of rows, so R thought it was doing the right thing.
This is essentially the same as #Metrics answer:
subset(df, year %in% c(2008, 2009, 2010, 2011))
And if you need help with %in%, see ?intersect

Resources