data table lapply and additional columns in output

data table lapply and additional columns in output - r

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?

Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.

There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

Related

Taking variance of some rows above in panel structrure (R data table )

# Example of a panel data
library(data.table)
panel<-data.table(expand.grid(Year=c(2017:2020),Individual=c("A","B","C")))
panel$value<-rnorm(nrow(panel),10) # The value I am interested in
I want to take the variance of prior two years value by Individual.
For example, if I were to sum the value of prior two years I would do something like:
panel[,sum_of_past_2_years:=shift(value)+shift(value, 2),Individual]
I thought this would work.
panel[,var(c(shift(value),shift(value, 2))),Individual]
# This doesn't work of course
Ideally the answer should look like
a<-c(NA,NA,var(panel$value[1:2]),var(panel$value[2:3]))
b<-c(NA,NA,var(panel$value[5:6]),var(panel$value[6:7]))
c<-c(NA,NA,var(panel$value[9:10]),var(panel$value[10:11]))
panel[,variance_past_2_years:=c(a,b,c)]
# NAs when there is no value for 2 prior years

You can use frollapply to perform rolling operation of every 2 values.
library(data.table)
panel[, var := frollapply(shift(value), 2, var), Individual]
# Year Individual value var
# 1: 2017 A 9.416218 NA
# 2: 2018 A 8.424868 NA
# 3: 2019 A 8.743061 0.49138739
# 4: 2020 A 9.489386 0.05062333
# 5: 2017 B 10.102086 NA
# 6: 2018 B 8.674827 NA
# 7: 2019 B 10.708943 1.01853361
# 8: 2020 B 11.828768 2.06881272
# 9: 2017 C 10.124349 NA
#10: 2018 C 9.024261 NA
#11: 2019 C 10.677998 0.60509700
#12: 2020 C 10.397105 1.36742220

Finding newest data older than a specific date in R

I have a two data.frames (call them dataset.new and dataset.old) that both contain information about some individuals. These individuals all have a identification number (a variable we can call ”individual”) that occurs in both of the data.frames and each frame has information on when the data was collected, stored in a column that we can call ”some.date”.
The second of these two data.frames (dataset.old) contains historical data for the individuals, i.e. values of some other variables measured at other times and thus each individual appears many times in dataset.old.
What I wish to do is the following. For each individual in dataset.new, find the rows from dataset.old that are the newest but still older than the observations in dataset.new. For the individuals that have no such date present in dataset.old, I want it to return NA.
This is perhaps easiest illustrated through some example data, presented below.
dataset.new
individual some.date
1 1 2016-05-01
2 2 2016-01-28
3 7 2016-03-03
dataset.old
individual some.date
1 1 2016-01-12
2 1 2015-12-30
3 1 2016-04-27
4 1 2016-05-02
5 2 2015-11-15
6 2 2012-01-27
7 2 2016-02-06
8 3 2016-04-30
9 3 2016-01-27
10 4 2016-03-01
11 4 2011-01-16
In this example, I am looking for a way get the following output:
individual row.nr
1 1 3
2 2 5
3 7 NA
since those rows correspond to the newest data in dataset.old that still is older than the data in dataset.new.
I have a code that solves the problem, but it is too slow for the data that I have in mind (which has well over 20 000 rows in dataset.new and many, many more in dataset.old). My solution is basically a loop over all individuals, subsetting the data at each stage.
find.previous <- function(dataset.old, individual, some.new.date){
subsetted.dataset <- dataset.old[dataset.old[, "individual"] == individual, ] # We only look at the individual in question.
subsetted.dataset <- subsetted.dataset[subsetted.dataset[, "some.date"] < some.new.date, ]# Here we get all the rows that have data that are measured BEFORE timepoint.
row.index <- which.min(some.new.date - subsetted.dataset[, "some.date"]) # This can be done, since we have already made sure that fromdatum < timepoint.
ifelse(length(row.index)!= 0, as.integer(rownames(subsetted.dataset[row.index,])), NA) # Then we output the row that had that information.
}
output <- matrix(ncol=2, nrow=0)
for(i in 1:nrow(dataset.new)){
output <- rbind(output, cbind(dataset.new[, "individual"][i], find.previous(dataset.old, dataset.new[, "individual"][i], dataset.new[, "some.date"][i])))
}
colnames(output) <- c("individual", "row.nr")
output
Any help on how to solve this problem would be greatly appreciated. I have tried using my Google skills as well as reading other posts on here stackoverflow, but without success.
The example data can be replicated by copying the following lines of code:
dataset.new <- data.frame(individual=c(1, 2, 7), some.date=as.Date(c("2016-05-01", "2016-01-28", "2016-03-03")))
dataset.old <- data.frame(individual=c(1,1,1,1,2,2,2,3,3,4,4), some.date=as.Date(c("2016-01-12", "2015-12-30", "2016-04-27", "2016-05-02", "2015-11-15", "2012-01-27", "2016-02-06", "2016-04-30", "2016-01-27", "2016-03-01", "2011-01-16")))

You can solve this efficiently with a merge.
First make the rownumber variable you want in dataset.old. Then merge dataset.new with dataset.old on individual (left join, or merge(lhs, rhs, all.x = TRUE)). This can get you:
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
4 1 2016-05-01 2016-05-02 4
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
7 2 2016-01-28 2016-02-06 7
8 7 2016-03-03 NA NA
Subset to new.date > old.date or is.na(old.date):
dataset.old
individual new.date old.date old.rownumber
1 1 2016-05-01 2016-01-12 1
2 1 2016-05-01 2015-12-30 2
3 1 2016-05-01 2016-04-27 3
5 2 2016-01-28 2015-11-15 5
6 2 2016-01-28 2012-01-27 6
8 7 2016-03-03 NA NA
Subset to old.date == max(old.date) or is.na(old.date) grouped by individual.
dataset.old
individual new.date old.date old.rownumber
3 1 2016-05-01 2016-04-27 3
6 2 2016-01-28 2012-01-27 5
8 7 2016-03-03 NA NA
Edit:
I'm partial to data.table. The code would look something like:
dataset.old[, old.rownumber := 1:.N]
setnames(dataset.old, "some.date", "old.date")
setnames(dataset.new, "some.date", "new.date")
dataset.merge <- merge(dataset.old, dataset.new, by = "individual", all.x = TRUE)
dataset.merge <- dataset.merge[, new.date > old.date]
dataset.merge[old.date == max(old.date) | is.na(old.date), by = individual]

We can skip the NA search by finding the minimum square root. The negative values will be coerced to missing for us:
dataset.old$rn <- 1:nrow(dataset.old)
minp <- function(x) if(!length(m <- which.min(as.numeric(x)^.5))) NA else m
mrg <- merge(dataset.new, dataset.old, by="individual", all.x=TRUE)
mrg %>% group_by(individual) %>%
summarise(row.nr=rn[minp(some.date.x - some.date.y)])
# A tibble: 3 x 2
# individual row.nr
# <int> <int>
# 1 1 3
# 2 2 5
# 3 7 NA

How can I aggregate data.table in quarterly frequency?

My data is available in monthly frequency and I'm trying to aggregate them in quarterly frequency. I'm working with data.table which package I dont understand very well, to be honest.
X.DATA_BASE NOME_INSTITUICAO SALDO.x SALDO.y
1: 199407 ASB S/A - CFI 1694581 1124580
2: 199407 BANCO ARAUCARIA S.A. 40079517 6314782
3: 199407 BANCO ATLANTIS S.A. 200463907 9356445
4: 199407 BANCO BANKPAR 1078342 5770046
5: 199407 BANCO BBI 97812975 31112289
For each date, which is defined by X.DATA_BASE, 199407 = July 1994. I have several institutions with SALDO.x and SALDO.y values. I want to add SALDO.x and SALDO.y for each institution in each quarterly. One of the problem is that some institutions get in and get out through the time. In the end of the day I want to have mydata with the same columns but quarterly frequency.
How could I do that?

Here's an example of how to group and sum by quarter (with thanks to #eddi for his suggested improvement). First let's create some fake date:
library(data.table)
set.seed(1485)
dat = data.table(date=rep(c(199401:199412,199501:199512),2),
firm=rep(c("A","B"), each=24),
value1=rnorm(48,1000,10),
value2=rnorm(48,2000,100))
dat
date firm value1 value2
1: 199401 A 1009.8620 2054.251
2: 199402 A 1009.7180 2124.202
3: 199403 A 1014.3421 1919.251
...
46: 199510 B 992.9961 2079.517
47: 199511 B 997.9147 1968.676
48: 199512 B 1002.5993 2006.231
Now, summarize by firm, year, and quarter. To do this, we create year and quarter grouping variables from date (we use integer division (%/%) to create the years and mod (%%) plus integer division to create the quarters), and calculate the sum of value1 and value2 for each sub-group. This all assumes date is numeric. If you have it stored as character or factor, convert to numeric first:
dat.summary = dat[ , list(valueByQuarter = sum(sum(value1) + sum(value2))),
by=list(firm,
year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1)]
dat.summary
firm year quarter valueByQuarter
1: A 1994 1 9131.626
2: A 1994 2 8953.116
3: A 1994 3 8981.407
4: A 1994 4 9175.959
5: A 1995 1 9003.225
6: A 1995 2 8962.690
7: A 1995 3 8809.256
8: A 1995 4 8885.264
9: B 1994 1 9000.791
10: B 1994 2 8936.356
11: B 1994 3 8905.789
12: B 1994 4 8951.369
13: B 1995 1 8922.716
14: B 1995 2 9097.134
15: B 1995 3 8724.188
16: B 1995 4 9047.934
For dplyr fans, here's a dplyr approach:
library(dplyr)
dat %>%
group_by(firm, year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1) %>%
summarise(valueByQuarter = sum(value1 + value2))

R: Create a column of averages based upon groups of four rows

>head(df)
person week target actual drop_out organization agency
1: QJ1 1 30 19 TRUE BB LLC
2: GJ2 1 30 18 FALSE BB LLC
3: LJ3 1 30 22 TRUE CC BBR
4: MJ4 1 30 24 FALSE CC BBR
5: PJ5 1 35 55 FALSE AA FUN
6: EJ6 1 35 50 FALSE AA FUN
There are around ~30 weeks in the dataset with a repeating Person ID each week.
I want to look at each person's values FOUR weeks at a time (so week 1-4, 5-9, 10-13, and so on). For each of these chunks, I want to add up all the "actual" columns and divide it by the sum of the "target" columns. Then we could put that value in a column called "monthly percent."
As per Shape's recommendation I've created a month column like so
fullReshapedDT$month <- with(fullReshapedDT, ceiling(week/4))
Trying to figure out how to iterate over the month column and calculate averages now. Trying something like this, but it obviously doesn't work:
fullReshapedDT[,.(monthly_attendance = actual/target,by=.(person_id, month)]

Have you tried creating a group variable? It will allow you to group operations by the four-week period:
setDT(df1)[,grps:=ceiling(week/4) #Create 4-week groups
][,sum(actual)/sum(target), .(person, grps) #grouped operations
][,grps:=NULL][] #Remove unnecessary columns
# person V1
# 1: QJ1 1.1076923
# 2: GJ2 1.1128205
# 3: LJ3 0.9948718
# 4: MJ4 0.6333333
# 5: PJ5 1.2410256
# 6: EJ6 1.0263158
# 7: QJ1 1.2108108
# 8: GJ2 0.6378378
# 9: LJ3 0.9891892
# 10: MJ4 0.8564103
# 11: PJ5 1.1729730
# 12: EJ6 0.8666667

merging data in R

I have a data set A
paper_id author_id
1 521630
1 1611750
2 9
3 627950
4 1456512
8 15
........
and a data set B
author_id author_name author_affiliation
9 Ernest Jordan Cambridge
14 K. MORIBE NA
15 D. Jakominich NA
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
........
I want to merge these two data sets in such a way so that merging is done through author_id but result should be seen like:
paper id author_id author_name author_affiliation
2 9 Ernest Jordan Cambridge
8 15 D. Jakominich NA
That is I want to have data in the order by paper_id only and merging is performed on the author_id, such that all the paper_id order doesnt get disturbed.
From what I am doing is:
b<-merge(A,B,by="author_id")
and I am getting. In this the paper_id is getting disturbed
author_id paper_id author_name author_affiliation
9 1468598 Ernest Jordan cambridge
9 1682105 Ernest Jordan cambridge
and then I have to sort this output by sorting through paper_id column.Its a very inefficient way.
How could this be done.
Thanks

This should do what you want.
b <-merge(A,B,by="author_id", sort=F)
b <- b[,c(2,1,3,4)]
You can turn off sorting on the by=... columns with sort=F, but merge(...) will always make the sort columns the first columns of the result. The last line of code just reverses columns 1 and 2.
EDIT (Response to #BrianDiggs comment)
#BrianDiggs is correct that, while sort=F will not force a sort on the by=... column, it does not guarantee the original sort order in A. If efficiency is a big concern, then consider the data.table package, which was built for this:
# create an example
A <- data.frame(paper_id=1:10000, author_id=rev(LETTERS[1:4]))
B <- data.frame(author_id=LETTERS[1:4],
author_name=c("Davies","Hawking","Carlyle","Higgs"),
author_affiliation=c("Oxford","Cambridge","UCL","Edinburgh"),
stringsAsFactors=F)
library(data.table)
A <- data.table(A,key="author_id")
B <- data.table(B,key="author_id")
A[B,c("author_name","author_affiliation"):=list(author_name,author_affiliation)]
setkey(A,paper_id)
head(A)
# paper_id author_id author_name author_affiliation
# 1: 1 D Higgs Edinburgh
# 2: 2 C Carlyle UCL
# 3: 3 B Hawking Cambridge
# 4: 4 A Davies Oxford
# 5: 5 D Higgs Edinburgh
# 6: 6 C Carlyle UCL
Unlike sort(...), setting a key in a data table sorts "by reference" using a radix algorithm. Sorting by reference means that the rows are rearranged in memory instead of copying the whole table into a new table. As a result, sorting data tables is extremely fast and memory efficient.
Also, the use of A[B,...] to do the merge is much faster than merging two data frames. In addition, this process appends the new columns to A (rather than creating a copy of A as with merge(...).

If you can consider non-base alternatives, then you may try the plyr equivalent of merge: join. From "Details" in ?join: Unlike merge, preserves the order of x no matter what join type is used.. Also the order of columns is preserved.
library(plyr)
join(A, B, type = "inner")
# Joining by: author_id
# paper_id author_id author_name author_affiliation
# 1 2 9 ErnestJordan Cambridge
# 2 8 15 Jakominich <NA>
inner_join in dplyr is similar. However, while the order of columns in x is kept, the columns in y seem to be sorted alphabetically:
library(dplyr)
inner_join(x = A, y = B)
# Joining by: "author_id"
# paper_id author_id author_affiliation author_name
# 1 2 9 Cambridge ErnestJordan
# 2 8 15 <NA> Jakominich

Too long for a comment
I do get what you want:
A <- read.table(text="paper_id author_id
1 521630
1 1611750
2 9
3 627950
4 1456512
8 15", header=T)
B <- read.table(text="author_id author_name author_affiliation
9 Ernest_Jordan Cambridge
14 K._MORIBE NA
15 D._Jakominich NA
25 William_H._Nailon NA
37 P._B._Littlewood Cavendish_Laboratory|Cambridge_University",
header=T)
b <- merge(A, B, by="author_id")
b
# author_id paper_id author_name author_affiliation
# 1 9 2 Ernest_Jordan Cambridge
# 2 15 8 D._Jakominich <NA>
Can you clarify your problem?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data table lapply and additional columns in output - r

Do you just need? dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor] Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.

Related

Taking variance of some rows above in panel structrure (R data table )

Finding newest data older than a specific date in R

How can I aggregate data.table in quarterly frequency?

R: Create a column of averages based upon groups of four rows

merging data in R

Categories

Resources