Related
I am using the data.table package to work with a very large data set, and value it's speed and clarity. But I am new to it and am having difficulties chaining functions together especially when working with a mixed set of data.table and base R functions. My question is, how do I chain together the below example functions, into one seamless string of code for defining the target data object?
Below is the correct output, generated by running each line of code separately (unchained) with the generating code shown immediately beneath the output:
> data
ID Period State Values
1: 1 1 X0 5
2: 1 2 X1 0
3: 1 3 X2 0
4: 1 4 X1 0
5: 2 1 X0 1
6: 2 2 XX 0
7: 2 3 XX 0
8: 2 4 XX 0
9: 3 1 X2 0
10: 3 2 X1 0
11: 3 3 X9 0
12: 3 4 X3 0
13: 4 1 X2 1
14: 4 2 X1 2
15: 4 3 X9 3
16: 4 4 XX 0
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
)
# changes State to "XX" if remaining Values_1 + Values_2 cumulative sums = 0 for each ID:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID]
# create new column "Values", which equals "Values_1":
setDT(data)[,Values := Values_1]
# in base R, drops columns Values_1 and Values_2:
data <- subset(data, select = -c(Values_1,Values_2)) # How to do this step in data.table, if possible or advisable?
# in base R, changes all "XX" elements in State column to "HI":
data$State <- gsub('XX','HI', data$State) # How to do this step in data.table, if possible or advisable?
For what it's worth, below is my attempt to chain together using '%>%' pipe operators, which fails (error message Error in data$State : object of type 'closure' is not subsettable), and though I'd rather chain together using data.table operators:
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
setDT(data)[,Values := Values_1] %>%
subset(data, select = -c(Values_1,Values_2)) %>%
data$State <- gsub('XX','HI', data$State)
If I understand correctly, the OP wants to
rename column Value_1 to Value (or in OP's words: create new column "Values", which equals "Values_1")
drop column Value_2
replace all occurrences of XX by HI in column State
Here is what I would do in data.table syntax:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID][
, Values_2 := NULL][
State == "XX", State := "HI"][]
setnames(data, "Values_1", "Values")
data
ID Period Values State
1: 1 1 5 X0
2: 1 2 0 X1
3: 1 3 0 X2
4: 1 4 0 X1
5: 2 1 1 X0
6: 2 2 0 HI
7: 2 3 0 HI
8: 2 4 0 HI
9: 3 1 0 X2
10: 3 2 0 X1
11: 3 3 0 X9
12: 3 4 0 X3
13: 4 1 1 X2
14: 4 2 2 X1
15: 4 3 3 X9
16: 4 4 0 HI
setnames() updates by reference, e.g., without copying. There is no need to create a copy of Values_1 and delete Values_1 later on.
Also, [State == "XX", State := "HI"] replaces XX by HI only in affected rows by reference while
[, State := gsub('XX','HI', State)] replaces the whole column.
data.table chaining is used where appropriate.
BTW: I wonder why the replacement of XX by HI cannot be done rightaway in the first statement:
setDT(data)[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "HI"), ID][
, Values_2 := NULL][]
setnames(data, "Values_1", "Values")
You can just chain using bracket notation [. That way you only need to call setDT() once, as you are continuing all operations in the data.table universe, so data does not stop being a data.table. Also setDT() modifies in place, so it does not need assignment (although by piping to it its return value is being assigned to data which is fine, too).
First define the data and make it a data.table:
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0", "X1", "X2", "X1", "X0", "X2", "X0", "X0", "X2", "X1", "X9", "X3", "X2", "X1", "X9", "X3")
) |>
setDT()
Then define the columns you need. Note the functional notation to apply a function on several columns.
data[, `:=`(
State = ifelse(
rev(cumsum(rev(Values_1 + Values_2))),
State, "XX"
)
),
by = ID
][
,
`:=`(
Values = Values_1,
Values_1 = NULL,
Values_2 = NULL,
State = gsub("XX", "HI", State)
)
]
Output:
data
# ID Period State Values
# 1: 1 1 X0 5
# 2: 1 2 X1 0
# 3: 1 3 X2 0
# 4: 1 4 X1 0
# 5: 2 1 X0 1
# 6: 2 2 HI 0
# 7: 2 3 HI 0
# 8: 2 4 HI 0
# 9: 3 1 X2 0
# 10: 3 2 X1 0
# 11: 3 3 X9 0
# 12: 3 4 X3 0
# 13: 4 1 X2 1
# 14: 4 2 X1 2
# 15: 4 3 X9 3
# 16: 4 4 HI 0
You may want to read further about chaining commands in data.table. I think that page is an excellent summary of the syntax and features of the package and is worth reading in its entirety.
You can use the magrittr package to chaining data.tables using . before [. Try the following code:
library(dplyr)
library(magrittr)
library(data.table)
data <-
data.frame(
ID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
Period = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
Values_1 = c(5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0),
Values_2 = c(5, 2, 0, 12, 2, 0, 0, 0, 0, 0, 0, 2, 4, 5, 6, 0),
State = c("X0","X1","X2","X1","X0","X2","X0","X0", "X2","X1","X9","X3", "X2","X1","X9","X3")
) %>%
setDT(data) %>%
.[, State := ifelse(rev(cumsum(rev(Values_1 + Values_2))), State, "XX"), ID] %>%
.[,Values := Values_1] %>%
select(-c(Values_1, Values_2)) %>%
mutate(State = gsub('XX','HI', State))
Output:
rn ID Period State Values
1: 1 1 1 X0 5
2: 2 1 2 X1 0
3: 3 1 3 X2 0
4: 4 1 4 X1 0
5: 5 2 1 X0 1
6: 6 2 2 HI 0
7: 7 2 3 HI 0
8: 8 2 4 HI 0
9: 9 3 1 X2 0
10: 10 3 2 X1 0
11: 11 3 3 X9 0
12: 12 3 4 X3 0
13: 13 4 1 X2 1
14: 14 4 2 X1 2
15: 15 4 3 X9 3
16: 16 4 4 HI 0
The data set has 3 columns-the 1st column is "id", the 2nd column is "treatment", and the 3rd column is "time". The 2nd column is a binary variable. Now, I want to extract the data by group based on the rule as follows.
1)Within each id, as long as the first row satisfy the condition of (time=1 and treatment=0),then we select the whole group data across id.
To sum, the expected data set should look like this,
id treatment time
1 0 1
1 0 2
1 0 3
1 0 4
1 0 5
1 0 6
1 0 7
1 NA 8
1 0 9
1 0 10
3 0 1
3 NA 2
3 1 3
3 1 4
3 1 5
3 1 6
3 1 7
3 NA 8
3 1 9
3 1 10
5 0 1
5 NA 2
5 0 3
5 0 4
5 0 5
5 0 6
5 0 7
5 0 8
5 0 9
5 0 10
The original data set with errors is structured as follows,
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), treatment = c(0, 0, 0, 0, 0, 0, 0, NA, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, NA, 1, 1, 1, 1, 1, NA, 1, 1,
NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0),
time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), row.names = c(NA,
50L), class = "data.frame")->dataframe
Thank you!
You can filter on the first value in each group.
library(dplyr)
dataframe %>%
group_by(id) %>%
filter(first(time) == 1 && first(LEP) == 0) %>%
ungroup
# id gender LEP time
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 0 1
# 2 1 0 0 2
# 3 1 0 0 3
# 4 1 0 0 4
# 5 1 0 0 5
# 6 1 0 0 6
# 7 1 0 0 7
# 8 1 0 NA 8
# 9 1 0 0 9
#10 1 0 0 10
# … with 20 more rows
Another approach in base R would be to extract only the first row of each id and keep only id which has LEP = 0 and time = 1.
subset(dataframe, id %in% id[!duplicated(id) & LEP == 0 & time == 1])
I have an asset class return stream that is incomplete. What I have done in Frontline Solver is to generate a return distribution that matches the correlations of the asset class in question to other asset classes for the maximum amount of data that is available. My objective function is to minimize the RMSE between that correlation matrix and the simulated correlation matrix of the entire time horizon. Constraints include setting a risk and return (mean and sd) for the asset class that is being simulated and also some bounds around how
many standard deviation each individual observation can be within.
I tried utilizing mvrnorm however it also re-sampled the data I used to establish the covariance matrix which I do not want since I care about time dependency.
I started to research different optimization/solver packages such as lpSolve and quadprog but having difficulty interpreting.
Below is a data frame, since I can't have it be random to use in helping with this problem.
data<-structure(list(Class1 = c(8, 4, 5, -3, 1, 1, 5, 0, -3, 4, 3,
-1, 2, 7, -2, 2, 5, 4, -1, 9, 2, 0, -2, 2, 2, -7, 1, 3), Class2 = c(4,
6, 4, 0, 0, -1, 5, 2, 0, 0, 0, -1, 1, 1, -1, 1, 2, 2, 0, 3, 0,
0, -4, 0, 0, -4, -2, 0), Class3 = c(6, 7, 4, -2, 1, 1, 5, 0,
-2, 4, 2, -2, 1, 6, -2, 2, 4, 4, 0, 7, 2, 0, -2, 2, 2, -6, 2,
2), Class4 = c(9, 5, 7, 0, 1, 0, 7, -2, -2, 3, 0, -2, 3, 6, 0,
2, 5, 5, 0, 7, 3, -1, -5, 1, 2, -8, 2, 2), Class5 = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, -3, 4, 4, 1, 2, 4, 4, -2,
4, 2, 0, -6, 1, 0, -9, 3, 4)), .Names = c("Class1", "Class2",
"Class3", "Class4", "Class5"), class = "data.frame", row.names = c(NA,-28L))
#Produces the following table
Class1 Class2 Class3 Class4 Class5
1 8 4 6 9 NA
2 4 6 7 5 NA
3 5 4 4 7 NA
4 -3 0 -2 0 NA
5 1 0 1 1 NA
6 1 -1 1 0 NA
7 5 5 5 7 NA
8 0 2 0 -2 NA
9 -3 0 -2 -2 NA
10 4 0 4 3 NA
11 3 0 2 0 NA
12 -1 -1 -2 -2 -3
13 2 1 1 3 4
14 7 1 6 6 4
15 -2 -1 -2 0 1
16 2 1 2 2 2
17 5 2 4 5 4
18 4 2 4 5 4
19 -1 0 0 0 -2
20 9 3 7 7 4
21 2 0 2 3 2
22 0 0 0 -1 0
23 -2 -4 -2 -5 -6
24 2 0 2 1 1
25 2 0 2 2 0
26 -7 -4 -6 -8 -9
27 1 -2 2 2 3
28 3 0 2 2 4
My goal is to get returns for Class5 for the entire 28 observations to match the correlation matrix of observations data[12:28,]. I also want to specify a custom mean and standard deviation of what the simulation should be. For example if you calulate the mean and sd for Class5 as it is now, you will get .76 and 3.8 respectively. However I want the new data to be lets say.. 1 and 5.
Again, if you do mvrnorm using a semi-custom mu and sigma then it will also re-simulate Class1-Class4.
I would like to reclassify the names of some individuals in a dataframe with consequtive letters, and the reclassification criterion has to change each X intervals since the first occurrence of an individual. I explain it better with an example.
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6)
df <- data.frame (ID, Year)
df
I have a dataset with repeated measures of some individuals along 6 years. As you can see some IDs like the "1" or "8" are repeated in Year == 1,2,3,4,5 for the ID == 1 and Year == 2,4 for the ID == 8. However different individuals may have the same ID if some time has happened since the first occurrence of an individual. It is because we consider that the individual dies each 2 years, and the ID may be reused.
In this hypothetical case, we assume that the life of an individual is 2 years, and that we can recognise during the sampling different individuals perfectly. The ID == 1 in the Year == 1 and Year == 2 represent the same individual, however the ID == 1 in the Year == 1,2, Year == 3,4 and Year == 5 represent different individuals. It is because the individual with ID == 1 from the Year == 1 couldn't live that long. The problem is that the first occurrence of the individuals may happen in different years and repeatedly as in this case. So the code has to forget an ID each 2 years since its first occurrence, and classify a new occurrence as a new individual.
I would like to name each individual with an unique ID. The new name does not have to be arranged chronologically as you can see with the ID == 1 in the Year == 5. I only want that they will be named with an unique name.
Below I have put the expected result.
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 1, 6, 6, 6)
new_ID <- c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "M", "N", "Q", "S", "L", "T", "U", "V", "W", "X", "Y", "Z", "CC", "AA", "BB", "Y")
new_df <- data.frame (ID, Year, new_ID)
new_df
As you can see the ID == 1 have different new_ID in the Year == 1 Year == 4 and Year == 5, because we assume that if one individual occurs for the first time in the Year == 1, an individual with the same ID in the Year == 3 is different, and the same with the individual that occurs in the Year == 5.
Thanks in advance.
You can use dplyr and cut:
library(dplyr)
df %>% group_by(ID) %>%
mutate(x = as.numeric(cut(Year, seq(min(Year)-1, max(Year)+1, 2))),
idout = paste0(ID, ".", x))
ID Year x idout
1 1 1 1 1.1
2 2 1 1 2.1
3 3 1 1 3.1
4 4 1 1 4.1
5 5 1 1 5.1
6 6 1 1 6.1
7 7 1 1 7.1
8 1 2 1 1.1
9 2 2 1 2.1
10 3 2 1 3.1
11 8 2 1 8.1
12 9 2 1 9.1
13 10 2 1 10.1
14 11 2 1 11.1
15 12 2 1 12.1
16 1 3 2 1.2
17 2 3 2 2.2
18 3 3 2 3.2
19 4 3 2 4.2
20 5 3 2 5.2
21 6 3 2 6.2
22 1 4 2 1.2
23 2 4 2 2.2
24 6 4 2 6.2
25 8 4 2 8.2
26 12 4 2 12.2
27 7 5 3 7.3
28 15 5 1 15.1
29 16 5 1 16.1
30 17 5 1 17.1
31 18 5 1 18.1
32 19 5 1 19.1
33 20 5 1 20.1
34 1 5 3 1.3
35 21 6 1 21.1
36 22 6 1 22.1
37 19 6 1 19.1
NB there are two mismatches with your desired output: row 34, and 15,26 where you have an L at years 2 and 4 with the same ID. I think these are mistakes?
ID <- c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 1, 2, 6, 8, 12, 7, 15, 16, 17, 18, 19, 20, 1, 21, 22, 19 )
Year <- c (1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6)
new_ID <- c("A", "B", "C", "D", "E", "F", "G", "A", "B", "C", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "M", "N", "Q", "S", "L", "T", "U", "V", "W", "X", "Y", "Z", "CC", "AA", "BB", "Y")
new_df <- data.frame (ID, Year, new_ID)
new_df
# if all ID renews same use:
newID<-sapply(unique(ID), function(x) c(0,cumsum(diff(Year[ID==x]))%%2))
# if some ID renews different year use:
newID<-sapply(unique(ID), function(x) {
mod<-2
if(x==1) mod <- 3
c(0,cumsum(diff(Year[ID==x]))%%mod)
})
names(newID)<-(unique(ID))
new_df<-data.frame(ID,Year,IDcond=NA,new_ID=NA)
for(i in unique(ID)){
new_df[new_df[,1]==i,3]<-newID[[which(unique(ID)==i)]]
}
ltrs<-c(LETTERS,apply(combn(LETTERS,2,simplify = T),2,function(x) paste(x,sep = "",collapse = "")))
ltrn<-0
for(i in 1:nrow(new_df)){
if(new_df[i,3]==0) {ltrn<-ltrn+1;new_df[i,4]<-ltrs[ltrn]}
else {ind<-which(new_df[,1]==new_df[i,1])
ind<-ind[ind<i]
new_df[i,4]<-tail(new_df[ind,4],1)}
}
new_df
> new_df
ID Year IDcond new_ID
1 1 1 0 A
2 2 1 0 B
3 3 1 0 C
4 4 1 0 D
5 5 1 0 E
6 6 1 0 F
7 7 1 0 G
8 1 2 1 A
9 2 2 1 B
10 3 2 1 C
11 8 2 0 H
12 9 2 0 I
13 10 2 0 J
14 11 2 0 K
15 12 2 0 L
16 1 3 0 M
17 2 3 0 N
18 3 3 0 O
19 4 3 0 P
20 5 3 0 Q
21 6 3 0 R
22 1 4 1 M
23 2 4 1 N
24 6 4 1 R
25 8 4 0 S
26 12 4 0 T
27 7 5 0 U
28 15 5 0 V
29 16 5 0 W
30 17 5 0 X
31 18 5 0 Y
32 19 5 0 Z
33 20 5 0 AB
34 1 5 0 AC
35 21 6 0 AD
36 22 6 0 AE
37 19 6 1 Z
I have a vector v and I want to have a vector w which is the weight of each element of v. How can I get the result (vector w)in R? For example,
v = c(0, 0, 1, 1, 1, 3, 4, 4, 4, 4, 5, 5, 6)
u = unique(v)
w = c(2, 3, 1, 4, 2, 1)
Use table:
table(v)
v
0 1 3 4 5 6
2 3 1 4 2 1