Change name of column after uniqueN function - r

I am already happy with the results, but want to further tidy up my data by giving the right name to the respective column.
The problem to solve is to give the number of different authors which are included for each years publication between 2000 and 2010. Here is my code and my result:
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000, uniqueN(Book_Author), by = "Year_Of_Publication"][order(Year_Of_Publication)]
Year_Of_Publication V1
1: 2000 12057
2: 2001 11818
3: 2002 11942
4: 2003 9913
5: 2004 4536
6: 2005 38
7: 2006 3
8: 2008 1
9: 2010 2
The numbers in the result are right, but I want to change the column name V1 to something like "Num_Of_Dif_Auth". I tried the setnames function, but as I don`t want to change the underlying dataset it didn´t help.

You can use :
library(data.table)
books_dt[Year_Of_Publication <= 2010 & Year_Of_Publication >= 2000,
.(Num_Of_Dif_Auth = uniqueN(Book_Author)),
by = Year_Of_Publication][order(Year_Of_Publication)]

Related

Unusual Behaviour of colon operator : in R

2000:2017
The expected output is a vector of the sequence 2000 to 2017 with a step of 1.
Output: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
'2000':'2017'
However, when I type this command, it still gives me the same output.
Output: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Unable to understand how it is generating sequence from characters.
Edit 1:
Ultimately, I am trying to understand why the code below worked? How can X2007:X2011 can possibly work? The select function is from dplyr package.
R code
My data also has similar column names as mentioned in the image above but I do not have 'X' there. I just have years like 2007,2008 etc.
For me select(Division, State, 2007:2011) does not work.
Error:Can't subset columns that don't exist.
x Locations 2007, 2008, 2009, 2010, and 2011 don't exist.
But this works select(Division, State, '2007':'2011').
If we check the more generic seq.default, it does changes the type from character to numeric for the from and to
...
if (!missing(from) && !is.finite(if (is.character(from)) from <- as.numeric(from) else from))
stop("'from' must be a finite number")
if (!missing(to) && !is.finite(if (is.character(to)) to <- as.numeric(to) else to))
...
Along on that lines, the documentation of ?: also says so
For other arguments from:to is equivalent to seq(from, to), and generates a sequence from from to to in steps of 1 or -1. Value to will be included if it differs from from by an integer up to a numeric fuzz of about 1e-7. Non-numeric arguments are coerced internally (hence without dispatching methods) to numeric—complex values will have their imaginary parts discarded with a warning.
Regarding the updated question with subset and select, if the column is numeric column name i.e. it starts with digit, it is an non-standard column name and evaluation of those can be done by backquoting
df1 <- data.frame(`2007` = 1:5, `2008` = 6:10,
`2012` = 11:15, v1 = rnorm(5), check.names = FALSE)
subset(df1, select = `2007`:`2012`)
# 2007 2008 2012
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15
Or with dplyr::select
library(dplyr)
select(df1, `2007`:`2012`)
# 2007 2008 2012
#1 1 6 11
#2 2 7 12
#3 3 8 13
#4 4 9 14
#5 5 10 15
If we have X at the beginning (happens when we read the data without check.names = FALSE - by default it is TRUE. Or when we create the dataset with data.frame - here also the check.names = TRUE by default)
df1 <- data.frame(`2007` = 1:5, `2008` = 6:10, `2012` = 11:15, v1 = rnorm(5))
subset(df1, select = X2007:X2012)
From what i know, : tries to coerce its 'arguments' to numeric, so that's why you got that output. Note that "a":"b" gives:
Error in "a":"c" : NA/NaN argument
In addition: Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion

Bridge the last and next non-NA value with intermediate values that grow evenly

What would be a good way to fill the missing NAs in a dataframe column with intermediate values that grow gradually from the last non-NA value to the next non-NA value?
Here is an example: for the column cost, I would like to obtain the column cost_esti where the cost increase by $31 each year between 2014 and 2016, bridging the last known cost of $595 to the next known cost of $720
The code I came up with is lengthy. Is there an elegant way to do the same?
library(data.table)
data = data.table(year=2000:2018,
cost = c(100,120,NA,200,220,NA,NA,300,350,470,500,NA,NA,595,NA,NA,NA,720,800))
data[,cost_nas:=as.numeric(is.na(cost))]
## consecutive nas so far for each row:
data[, consecutive_nas_so_far := seq_len(.N), by=rleid(cost_nas)]
data[cost_nas==0,consecutive_nas_so_far:=0]
# total number of consecutive nas in the sequence
data[,total_number_of_consec_nas:=ifelse(consecutive_nas_so_far>0&shift(consecutive_nas_so_far,1,type = "lead")==0,consecutive_nas_so_far,NA)]
data[cost_nas==0,total_number_of_consec_nas:=0]
data[,total_number_of_consec_nas:=zoo::na.locf(total_number_of_consec_nas,fromLast=T)]
#get last and next known values for cost:
data[,cost_previous:=zoo::na.locf(cost)]
data[,cost_following:=zoo::na.locf(cost,fromLast=T)]
# apply the formula to calculate the gradual increase from cost_previous to cost_following
data[,cost_esti:=round(consecutive_nas_so_far*(cost_following-cost_previous)/(total_number_of_consec_nas+1)+cost_previous,0)]
data[is.na(cost_esti),cost_esti:=cost]
You can re-write data.table operations using zoo::na.locf and data.table::rleid. Add 2 columns, one each for lastNonNA and nextNonNA using na.locf. rleid will provide you distinct group for continuous NA. Now you can write logic to fill NA using linear between lastNonNA and nextNonNA.
library(data.table)
library(zoo)
#Data
data = data.table(year=2000:2018,
cost = c(100,120,NA,200,220,NA,NA,300,350,470,500,NA,NA,595,NA,NA,NA,720,800))
data[,':='(lastNonNA = na.locf(cost, fromLast = FALSE),
nextNonNA = na.locf(cost, fromLast = TRUE), Group_NA = rleid(is.na(cost)))][
,':='(IDX = 1:.N), by=Group_NA][
,':='(cost = ifelse(is.na(cost), lastNonNA + IDX*((nextNonNA - lastNonNA)/(.N+1)),cost)),
by=Group_NA][,.(year, cost)]
# year cost
# 1: 2000 100.0000
# 2: 2001 120.0000
# 3: 2002 160.0000 #Filled
# 4: 2003 200.0000
# 5: 2004 220.0000
# 6: 2005 246.6667 #Filled
# 7: 2006 273.3333 #Filled
# 8: 2007 300.0000
# 9: 2008 350.0000
# 10: 2009 470.0000
# 11: 2010 500.0000
# 12: 2011 531.6667 #Filled
# 13: 2012 563.3333 #Filled
# 14: 2013 595.0000
# 15: 2014 626.2500 #Filled
# 16: 2015 657.5000 #Filled
# 17: 2016 688.7500 #Filled
# 18: 2017 720.0000
# 19: 2018 800.0000
What you are asking for in the question is a linear interpolation.
It can be obtained quite easily in R for your data with NAs.
In this case the solution would be:
library("imputeTS")
na_interpolation(data, option = "linear")
You could also use option = "spline" or "stine" then the increase wouldn't necessarily strictly linear.

R data.table Conditional Sum: Cleaner way

This of course is a very often encountered problem, so I have expected many questions here on SO regarding this. However, all the answers that I could find were very specific to the question and often encountered workarounds (you don't have to do this, foobar is much better in this scenario) or non data.table solutions. Perhaps this is because it should be a no-brainer with data.table
I have a data.table which contains yearly data on tentgelt and te_med. For each year, I want to know the share of observations for which tentgelt > te_med. This is what I am doing:
# note that nAbove and nBelow do not add up to 1
nAbove <- wages[tentgelt > te_med, list(nAbove = .N), by=list(year)]
nBelow <- wages[tentgelt < te_med, list(nBelow = .N), by=list(year)]
nBelow[nAbove][, list(year, foo=nAbove/(nAbove+nBelow))]
which works but whenever I see other people's data.table code, it looks much clearer and easier than my workarounds. Is there a cleaner way to get the following type of output?
year foo
1: 1993 0.2372093
2: 1994 0.1567568
3: 1995 0.8132530
4: 1996 0.1235955
5: 1997 0.1065574
6: 1998 0.3070684
7: 1999 0.1491974
Here's a sample of my data:
year tentgelt te_med
1: 2010 120.95 53.64929
2: 2010 9.99 116.72601
3: 2010 113.52 53.07394
4: 2010 10.27 38.45728
5: 2010 48.58 124.65753
6: 2010 96.38 86.99060
7: 2010 3.46 65.75342
8: 2010 107.52 91.87592
9: 2010 107.52 42.92953
10: 2010 3.46 73.92328
11: 2010 96.38 85.23419
12: 2010 2.25 79.19995
13: 2010 42.32 35.75757
14: 2010 7.94 93.44305
15: 2010 120.95 113.41370
16: 2010 7.94 110.68628
17: 2010 107.52 127.30682
18: 2010 2.25 103.49036
19: 2010 120.95 123.62054
20: 2010 96.38 68.57532
For this sample, the expected output should be:
year V2
1: 2010 0.45
Try this
wages[, list(foo= sum(tentgelt > te_med)/.N), by = year]
# year foo
# 1: 2010 0.45

efficient date comparison in data table

I have a data frame (actually a data table) that looks like
id hire.date survey.year
1 15-04-2003 2003
2 16-07-2001 2001
3 06-06-1980 2002
4 17-08-1981 2001
I need to check if hire.date is less than say 31st March of survey.year. So I would end up with something like
id hire.date survey.year emp31mar
1 15-04-2003 2003 FALSE
2 16-07-2001 2001 FALSE
3 06-06-1980 2002 TRUE
4 17-08-1981 2001 TRUE
I could always create an object holding March 31st of survey.year and then make the appropriate comparison like so
mar31 = as.Date(paste0("31-03-", as.character(myData$survey.year)), "%d-%m-%Y")
myData$emp31 = myData$hiredate < mar31
but creating the object mar31 is consuming too much time because myData is large-ish (think tens of millions of rows).
I wonder if there is a more efficient way of doing this -- a way that doesn't involve creating an object such as mar31?
You could try the data.table methods for creating the column.
library(data.table)
setDT(df1)[, emp31mar:= as.Date(hire.date, '%d-%m-%Y') <
paste(survey.year, '03-31', sep="-")][]
# id hire.date survey.year emp31mar
#1: 1 15-04-2003 2003 FALSE
#2: 2 16-07-2001 2001 FALSE
#3: 3 06-06-1980 2002 TRUE
#4: 4 17-08-1981 2001 TRUE

Subsetting a data.table using another data.table

I have the dt and dt1 data.tables.
dt<-data.table(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
dt1<-data.table(id=rep(2, 5), year=c(2005:2009), performance=(1000:1004))
dt
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
4: 4 2005 0
5: 4 2006 1
dt1
id year performance
1: 2 2005 1000
2: 2 2006 1001
3: 2 2007 1002
4: 2 2008 1003
5: 2 2009 1004
I would like to subset the former using the combination of its first and second column that also appear in dt1. As a result of this, I would like to create a new object without overwriting dt. This is what I'd like to obtain.
id year event
1: 2 2005 1
2: 2 2006 0
3: 2 2007 0
I tried to do this using the following code:
dt.sub<-dt[dt[,c(1:2)] %in% dt1[,c(1:2)],]
but it didn't work. As a result, I got back a data table identical to dt. I think there are at least two mistakes in my code. The first is that I am probably subsetting the data.table by column using a wrong method. The second, and pretty evident, is that %in% applies to vectors and not to multiple-column objects. Nevertherless, I am unable to find a more efficient way to do it...
Thank you in advance for your help!
setkeyv(dt,c('id','year'))
setkeyv(dt1,c('id','year'))
dt[dt1,nomatch=0]
Output -
> dt[dt1,nomatch=0]
id year event performance
1: 2 2005 1 1000
2: 2 2006 0 1001
3: 2 2007 0 1002
Use merge:
merge(dt,dt1, by=c("year","id"))
year id event performance
1: 2005 2 1 1000
2: 2006 2 0 1001
3: 2007 2 0 1002

Resources