Efficient Way to Build Data Frame / Data Table

Efficient Way to Build Data Frame / Data Table - r

I have a data.frame that I am using to set parameters for simulations.
states_grid <- expand.grid(years = c(1:47), start_pct = c(0:99), sim_num = c(1:50))
The above code creates all the states that I would like to simulate. My issue becomes creating a data.frame to hold the outputs. What I would like to do is to create a larger data frame in which we add in an ob_num variable. The ob_num variable will run from 1 to the number of years indicated in column 1.
For example:
years start_pct sim_num ob_num
1: 2 99 1 1
2: 2 99 1 2
3: 3 99 1 1
4: 3 99 1 2
5: 3 99 1 3
6: 4 99 1 1
7: 4 99 1 2
8: 4 99 1 3
9: 4 99 1 4
However I can't think of an efficient way to create this data frame.
Thoughts?
Edit: I tried the below suggestion but that didn't seem to do it.
The below code returns a data.table of the same size (235,000) rows.
states_grid <- expand.grid(years = c(1:(year_max - year_min + 1)),
start_pct = c(0:99),
sim_num = c(1:50))
states_grid <- data.table(states_grid)
setDT(states_grid)[, ob_num := 1:.N, by = years][]
I also tried:
states_grid <- setDT(states_grid)[, ob_num := 1:.N, by = years][]
Both methods return 235K rows.

CJ(years = c(1:47), start_pct = c(0:99), sim_num = c(1:50))[,
.(ob_num = seq_len(years)), by = .(years, start_pct, sim_num)]
# years start_pct sim_num ob_num
# 1: 1 0 1 1
# 2: 1 0 2 1
# 3: 1 0 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# ---
#5639996: 47 99 50 43
#5639997: 47 99 50 44
#5639998: 47 99 50 45
#5639999: 47 99 50 46
#5640000: 47 99 50 47

Related

How I get the value of a variable using the lag position that comes from another variable?

I am trying to get the values of a variable (B) that cames from the lag position given by other variable (A).
The variables are something like this:
# A B
# 1: 1 10
# 2: 1 20
# 3: 1 30
# 4: 1 40
# 5: 2 50
I want the output (C) to be like this, the first value woud be zero and the condition start in the second row:
# A B C
# 1: 1 10 0
# 2: 1 20 10
# 3: 1 30 20
# 4: 2 40 20
# 5: 2 50 30
I have done it with loops but because it´s a large amount of information is a lot of time to wait. I hope someone could give me an idea.

Here's a way with dplyr:
library(dplyr)
x %>%
mutate(
C = c(0, B[(2:n()) - A[-1]])
)
# A B C
# 1: 1 10 0
# 2: 1 20 10
# 3: 1 30 20
# 4: 2 40 20
# 5: 2 50 30
It translates directly to data.table (with your colons in row names, I thought you might be using that package)
library(data.table)
dt = as.data.table(x)
dt[, C := c(0, B[(2:.N) - A[-1]])]
dt
# A B C
# 1: 1 10 0
# 2: 1 20 10
# 3: 1 30 20
# 4: 2 40 20
# 5: 2 50 30
Using this data:
x = read.table(text =' A B
1: 1 10
2: 1 20
3: 1 30
4: 2 40
5: 2 50', header = T)

Count next n rows that meets a condition in R

Let's say I have a df that looks like this
ID X_Value
1 40
2 13
3 75
4 83
5 64
6 43
7 74
8 45
9 54
10 84
So what I would like to do, is to do a rolling function that if in the actual and last 4 rows, there are 2 or more values that are higher than X (let's say 70 for this example) then return 1, else 0.
So the output would be something like the following:
ID X_Value Next_4_2
1 40 0
2 13 0
3 75 0
4 83 1
5 64 1
6 43 1
7 24 1
8 45 0
9 74 0
10 84 1
I think this would be possible with a rolling function, but I have tried and not sure how to do it. Thank you in advance

Given your expected output, I suppose you meant "in the actual and previous 3 rows". Then using some rolling function indeed does the job:
library(zoo)
thr1 <- 70
thr2 <- 2
last <- 3 + 1
df$Next_4_2 <- 1 * (rollsum(df$X_Value > thr1, last, align = "right", fill = 0) >= thr2)
df
# ID X_Value Next_4_2
# 1 1 40 0
# 2 2 13 0
# 3 3 75 0
# 4 4 83 1
# 5 5 64 1
# 6 6 43 1
# 7 7 74 1
# 8 8 45 0
# 9 9 54 0
# 10 10 84 1

The indexing using max(1,i-3) is perhaps the only part of the code worth remembering. I might help in subsequent construction when a for-loop was really needed.
dat$X_Next_4_2 <- integer( length(dat$X_Value) )
dat$ X_Next_4_2[1]=0
for (i in 2:length(dat$X_Value) ){
dat$ X_Next_4_2[i]=
( sum(dat$X_Value[i: (max(0, i-4) )] >=70) >=2 )}
(Not very pretty and clearly inferior to the rollsum answer already posted.)

calculate fat free mass and apply the function to data set

I have a dataset that looks like this:
ID SEX WEIGHT BMI
1 2 65 25
1 2 65 25
1 2 65 25
2 1 70 30
2 1 70 30
2 1 70 30
2 1 70 30
3 2 50 18
3 2 50 18
4 1 85 20
4 1 85 20
I want to calculate fat free mass (FFM) and attach the value in a new column in the dataset for each individual. These are the functions to calculate FFM for males and females:
for males (SEX=1):
FFMCalMale <- function (WEIGHT, BMI) {
FFM = 9270*WEIGHT/(6680+216*BMI)
}
and for females (SEX=2):
FFMCalFemale <- function(WEIGHT, BMI) {
FFM = 9270*WEIGHT/(8780+244*BMI)
}
I want to modify this function so it check for the SEX (1, male or 2 is female) then do the calculation for FFM based on that and apply the function for each individual. Could you please help?
Thanks in advance!

You could use ifelse
data$FFM <- ifelse(data$SEX==1,
FFMCalMale(data$WEIGHT, data$BMI),
FFMCalFemale(data$WEIGHT, data$BMI))

A data.table approach:
mydata <- read.table(
header = T, con <- textConnection
('
ID SEX WEIGHT BMI
1 2 65 25
1 2 65 25
1 2 65 25
2 1 70 30
2 1 70 30
2 1 70 30
2 1 70 30
3 2 50 18
3 2 50 18
4 1 85 20
4 1 85 20
'), stringsAsFactors = FALSE)
close(con)
library(data.table) ## load data.table
setDT(mydata) ## convert the data to datatable
FFMCalMale <- function (WEIGHT, BMI) {
FFM = 9270*WEIGHT/(6680+216*BMI)
}
FFMCalFemale <- function(WEIGHT, BMI) {
FFM = 9270*WEIGHT/(8780+BMI)
}
setkey(mydata, SEX)
mydata[, FFM := ifelse(SEX == 1,
FFMCalMale(WEIGHT, BMI),
FFMCalFemale(WEIGHT, BMI))][]
# ID SEX WEIGHT BMI FFM
# 1: 2 1 70 30 49.30851
# 2: 2 1 70 30 49.30851
# 3: 2 1 70 30 49.30851
# 4: 2 1 70 30 49.30851
# 5: 4 1 85 20 71.63182
# 6: 4 1 85 20 71.63182
# 7: 1 2 65 25 68.43271
# 8: 1 2 65 25 68.43271
# 9: 1 2 65 25 68.43271
# 10: 3 2 50 18 52.68243
# 11: 3 2 50 18 52.68243

Here are two ways, one just taking the dataframe (assuming it contains columns with the names SEX, WEIGHT, and BMI):
dffunc <- function(dataframe) {
ifelse(dataframe$SEX == 1,
9270 * dataframe$WEIGHT / (6680 + 216 * dataframe$BMI),
9270 * dataframe$WEIGHT / (8780 + dataframe$BMI))
}
or as you originally formatted it, but adding the SEX parameter:
func <- function(WEIGHT, BMI, SEX) {
ifelse(SEX == 1,
9270 * WEIGHT / (6680 + 216 * BMI),
9270 * WEIGHT / (8780 + BMI))
}

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.

Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Remove duplicate observations based on set of rules

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))

You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45

With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Efficient Way to Build Data Frame / Data Table - r

Related

How I get the value of a variable using the lag position that comes from another variable?

Count next n rows that meets a condition in R

calculate fat free mass and apply the function to data set

Rank function to rank multiple variables in R

Remove duplicate observations based on set of rules

Categories

Resources