Counting the unique values of two combined columns [duplicate] - r

This question already has an answer here:
How to count the number of unique values by group? [duplicate]
(1 answer)
Closed 1 year ago.
I have a data.table as follows
library(data.table)
library(haven)
df1 <- fread(
"A B C iso year
0 B 1 NLD 2009
1 A 2 NLD 2009
0 Y 3 AUS 2011
1 Q 4 AUS 2011
0 NA 7 NLD 2008
1 0 1 NLD 2008
0 1 3 AUS 2012",
header = TRUE
)
I want to count the unique values of the combination of iso, and year (which would be NLD 2009, AUS 2011, NLD 2008 and AUS 2012, so 4.
I tried df1[,uniqueN(.(iso, year))] and df1[,uniqueN(c("iso", "year"))]
The first one gives an error, and the second one gives the answer 2, where I am looking for 4 unique combinations.
What am I doing wrong here?
(as I am doing this with a big dataset of strings, I would prefer no to combine the columns, then test).

You can solve it as follows using data.table package.
df1[, uniqueN(.SD), .SDcols=c("iso", "year")]
or
uniqueN(df1, by=c("iso", "year"))

Alternative to the data.table approach, count from dplyr does it very nicely:
library(dplyr)
df1 %>% count(iso, year)
Output:
iso year n
1: AUS 2011 2
2: AUS 2012 1
3: NLD 2008 2
4: NLD 2009 2

Related

How to add a column by matching with previous year?

I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!
You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006

R: Count number of new observations compared to a previous groups

I would like to know the number of new observations that occurred between groups.
If I have the following data:
Year
Observation
2009
A
2009
A
2009
B
2010
A
2010
B
2010
C
I wound like the output to be
Year
New_Obsevation_Count
2009
2
2010
1
I am new to R and don't really know how to move forward. I have tried using the count function in the tidyverse package but still can't figure out.
You can use union in Reduce:
y <- split(x$Observation, x$Year)
data.frame(Year = names(y), nNew =
diff(lengths(Reduce(union, y, NULL, accumulate = TRUE))))
# Year nNew
#1 2009 2
#2 2010 1
Data:
x <- read.table(header=TRUE, text="Year Observation
2009 A
2009 A
2009 B
2010 A
2010 B
2010 C")

How to repeat observations n times in R? [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 3 years ago.
I am working on incorporating a variable that is recorded once per unit to a yearly dataset. While it is quite straightforward to repeat the observations n times, I have trouble assigning years to the observations.
The structure of my data is as follows:
id startyear endyear dummy
1 1946 2005 1
2 1957 2005 1
3 1982 2005 1
4 1973 2005 1
What I want to do is to create a new row, called years, which repeats unit 1 n = 2005 - 1946 = 59 times; unit 2 2005-1957 times, and so forth as well as assigning the year, generating the following output:
id startyear endyear dummy year
1 1946 2005 1 1946
1 1946 2005 1 1947
1 1946 2005 1 1948
1 1946 2005 1 1949
[…]
I have attempted to use slice and mutate in dplyr, in combination with rep and seq but neither gives me the result I want. Any help would be greatly appreciated.
We can use map2 to create a sequence from 'startyear' to 'endyear' for each element into a list and then unnest
library(tidyverse)
df1 %>%
mutate(year = map2(startyear, endyear, `:`)) %>%
unnest
# id startyear endyear dummy year
#1 1 1946 2005 1 1946
#2 1 1946 2005 1 1947
#3 1 1946 2005 1 1948
#4 1 1946 2005 1 1949
#5 1 1946 2005 1 1950
#6 1 1946 2005 1 1951
#7 1 1946 2005 1 1952
#...
Or do a group by 'id', mutate into a list and unnest
df1 %>%
group_by(id) %>%
mutate(year = list(startyear:endyear)) %>%
unnest
Less elegant alternative, almost as simple:
library(tidyverse)
df1 %>%
uncount(endyear - startyear + 1, .id = "row") %>%
mutate(year = startyear + row - 1)

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Subsetting a data.table using another data.table

I have the dt and dt1 data.tables.
dt<-data.table(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
dt1<-data.table(id=rep(2, 5), year=c(2005:2009), performance=(1000:1004))
dt
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
4: 4 2005 0
5: 4 2006 1
dt1
id year performance
1: 2 2005 1000
2: 2 2006 1001
3: 2 2007 1002
4: 2 2008 1003
5: 2 2009 1004
I would like to subset the former using the combination of its first and second column that also appear in dt1. As a result of this, I would like to create a new object without overwriting dt. This is what I'd like to obtain.
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
I tried to do this using the following code:
dt.sub<-dt[dt[,c(1:2)] %in% dt1[,c(1:2)],]
but it didn't work. As a result, I got back a data table identical to dt. I think there are at least two mistakes in my code. The first is that I am probably subsetting the data.table by column using a wrong method. The second, and pretty evident, is that %in% applies to vectors and not to multiple-column objects. Nevertherless, I am unable to find a more efficient way to do it...
Thank you in advance for your help!
setkeyv(dt,c('id','year'))
setkeyv(dt1,c('id','year'))
dt[dt1,nomatch=0]
Output -
> dt[dt1,nomatch=0]
id year event performance
1: 2 2005 1 1000
2: 2 2006 0 1001
3: 2 2007 0 1002
Use merge:
merge(dt,dt1, by=c("year","id"))
year id event performance
1: 2005 2 1 1000
2: 2006 2 0 1001
3: 2007 2 0 1002

Resources