The problem:
I would like to construct a variable that measures cumulative work experience within a person-year longitudinal data set. The problem applies to all sorts of longitudinal data sets and many variables might be constructed in this cumulative way (e.g., number of children, cumulative education, cumulative dollars spend on vacations, etc.)
The case:
I have a large longitudinal data set in which every row constitutes a person year. The data set contains thousands of persons (variable “ID”) followed through their lives (variable “age”), resulting in a data frame with about 1.2 million rows. One variable indicates how many months a person has worked in each person year (variable “work”). For example, when Dan was 15 years old he worked 3 months.
ID age work
1 Dan 10 0
2 Dan 11 0
3 Dan 12 0
4 Dan 13 0
5 Dan 14 0
6 Dan 15 3
7 Dan 16 5
8 Dan 17 8
9 Dan 18 5
10 Dan 19 12
11 Jeff 20 0
12 Jeff 16 0
13 Jeff 17 0
14 Jeff 18 0
15 Jeff 19 0
16 Jeff 20 0
17 Jeff 21 8
18 Jeff 22 10
19 Jeff 23 12
20 Jeff 24 12
21 Jeff 25 12
22 Jeff 26 12
23 Jeff 27 12
24 Jeff 28 12
25 Jeff 29 12
I now want to construct a cumulative work experience variable, which adds the value of year x to year x+1. The goal is to know at each age of a person how many months they have worked in their entire carrier. The variable should look like “cumwork”.
ID age work cumwork
1 Dan 10 0 0
2 Dan 11 0 0
3 Dan 12 0 0
4 Dan 13 0 0
5 Dan 14 0 0
6 Dan 15 3 3
7 Dan 16 5 8
8 Dan 17 8 16
9 Dan 18 5 21
10 Dan 19 12 33
11 Jeff 20 0 0
12 Jeff 16 0 0
13 Jeff 17 0 0
14 Jeff 18 0 0
15 Jeff 19 0 0
16 Jeff 20 0 0
17 Jeff 21 8 8
18 Jeff 22 10 18
19 Jeff 23 12 30
20 Jeff 24 12 42
21 Jeff 25 12 54
22 Jeff 26 12 66
23 Jeff 27 12 78
24 Jeff 28 12 90
25 Jeff 29 12 102
A poor solution: I can construct such a cumulative variable using the following simple loop:
# Generate test data set
x=data.frame(ID=c(rep("Dan",times=10),rep("Jeff",times=15)),age=c(10:20,16:29),work=c(rep(0,times=5),3,5,8,5,12,rep(0,times=6),8,10,rep(12,times=7)),stringsAsFactors=F)
# Generate cumulative work experience variable
x$cumwork=x$work
for(r in 2:nrow(x)){
if(x$ID[r]==x$ID[r-1]){
x$cumwork[r]=x$cumwork[r-1]+x$cumwork[r]
}
}
However, my dataset has 1.2 million rows and looping through each row is highly inefficient and running this loop would take hours. Does any brilliant programmer have a suggestion of how to construct this cumulative measure most efficiently?
Many thanks in advance!
Best,
Raphael
ave is convenient for these types of tasks. The function you want to use with it is cumsum:
x$cumwork <- ave(x$work, x$ID, FUN = cumsum)
x
# ID age work cumwork
# 1 Dan 10 0 0
# 2 Dan 11 0 0
# 3 Dan 12 0 0
# 4 Dan 13 0 0
# 5 Dan 14 0 0
# 6 Dan 15 3 3
# 7 Dan 16 5 8
# 8 Dan 17 8 16
# 9 Dan 18 5 21
# 10 Dan 19 12 33
# 11 Jeff 20 0 0
# 12 Jeff 16 0 0
# 13 Jeff 17 0 0
# 14 Jeff 18 0 0
# 15 Jeff 19 0 0
# 16 Jeff 20 0 0
# 17 Jeff 21 8 8
# 18 Jeff 22 10 18
# 19 Jeff 23 12 30
# 20 Jeff 24 12 42
# 21 Jeff 25 12 54
# 22 Jeff 26 12 66
# 23 Jeff 27 12 78
# 24 Jeff 28 12 90
# 25 Jeff 29 12 102
However, given the scale of your data, I would also strongly suggest the "data.table" package, which also gives you access to convenient syntax:
library(data.table)
DT <- data.table(x)
DT[, cumwork := cumsum(work), by = ID]
Related
I have several months of weather data; an example day is here:
Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10
I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this?
Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?
This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value.
df <- read.table(text = "Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", header = TRUE)
with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)),
lower = min(Hour), upper = max(Hour)))
#> 53.00017 with absolute error < 0.0039
Created on 2019-02-08 by the reprex package (v0.2.1.9000)
The OP has requested to figure out the total number of hours above 15 degrees by integrating in R.
It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate").
However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.:
library(data.table)
setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") :=
.(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))]
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
1: 1 11 0 0
2: 2 11 0 0
3: 3 11 0 0
4: 4 10 0 0
5: 5 10 0 0
6: 6 11 0 0
7: 7 12 0 0
8: 8 14 0 0
9: 9 15 0 0
10: 10 17 2 1
11: 11 19 6 2
12: 12 21 12 3
13: 13 22 19 4
14: 14 24 28 5
15: 15 23 36 6
16: 16 22 43 7
17: 17 21 49 8
18: 18 18 52 9
19: 19 16 53 10
20: 20 15 53 10
21: 21 14 53 10
22: 22 12 53 10
23: 23 11 53 10
24: 24 10 53 10
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
Alternatively,
setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)),
deg_hrs_cnt = sum(Avg.Temp > 15))]
returns only the final result (last row):
deg_hrs_sum deg_hrs_cnt
1: 53 10
Data
library(data.table)
DT <- fread("
rn Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", drop = 1L)
So I am working with this matrix (see below) on R where you have the individuals and the the number of times they fought on Left,Right and total fight. I would like to do ANOVA to see the difference in number of fights per individuals. However I cannot use the column with the names so I need to add it and that's when I have a problem:
Left Right Total
DarkMale 0 1 1
Melman 5 2 7
Polp 0 12 12
Sun 10 1 11
Kevin 0 11 11
McFly 0 30 30
Lovely 36 0 36
Aquarius 0 30 30
Kenny 0 23 23
Lethabo 16 0 16
Charlie 0 3 3
Indv=rbind("DarkMale","Melman","Polp","Sun","Kevin","McFly","Lovely","Aquarius","Kenny","Lethabo","Charlie")
tab=cbind(tab,Total,Indv)
colnames(tab)=c("Left","Right","Total","Individuals")
I did this but then it converters the rest of table in Character which I cannot use either.
I have tried testtab=as.data.frame(tab,stringsAsFactors=FALSE)
which got rid of the "" in the table but still keeps all values in character.
How can I convert the table by keeping these values (see below) but with it being integer or factor that I could use for anova?
Left Right Total Individuals
DarkMale 0 1 1 DarkMale
Melman 5 2 7 Melman
Polp 0 12 12 Polp
Sun 10 1 11 Sun
Kevin 0 11 11 Kevin
McFly 0 30 30 McFly
Lovely 36 0 36 Lovely
Aquarius 0 30 30 Aquarius
Kenny 0 23 23 Kenny
Lethabo 16 0 16 Lethabo
Charlie 0 3 3 Charlie
Cheers
We need to first convert to data.frame and then create a column from the row names
d1 <- transform(as.data.frame(m1), Individuals = row.names(m1))
Using cbind on a matrix with a character element/elements convert the whole matrix to character as matrix can hold only a single class. Afterwards, if we convert to data.frame, the class remains as such or change to factor depending on whether stringsAsFactors is FALSE/TRUE.
Here is another way to do it. I generated a matrix to start with what you are starting with, then transformed it into a dataframe. For more compact solution use transform as mentioned in akrun solution.
tab <- matrix(data =c(1:33) , nrow = 11, ncol = 3)
df <- as.data.frame(tab)
Indv <- c("DarkMale","Melman","Polp","Sun","Kevin","McFly","Lovely","Aquarius","Kenny","Lethabo","Charlie")
colnames <- c("Left","Right","Total","Individuals")
df[4] <- Indv
rownames(df) <- Indv
colnames(df) <- colnames
#
# Left Right Total Individuals
# DarkMale 1 12 23 DarkMale
# Melman 2 13 24 Melman
# Polp 3 14 25 Polp
# Sun 4 15 26 Sun
# Kevin 5 16 27 Kevin
# McFly 6 17 28 McFly
# Lovely 7 18 29 Lovely
# Aquarius 8 19 30 Aquarius
# Kenny 9 20 31 Kenny
# Lethabo 10 21 32 Lethabo
# Charlie 11 22 33 Charlie
How do I create a new set of data frame columns based on matched row values?
For instance, for this sample data frame:
x<-data.frame(cbind(numsp=rep(c(16,64,256),each=12),Colless=rep(c("loIc","midIc","hiIc"),each=4, times=3), lambdaE=rep(c(TRUE,FALSE),each=2,times=9),ntree=rep(c(1,2),length.out=36), metric1=seq(1:36), metric2=seq(1:36)))
For when some parameter, e.g., lambdaE, I'd like to create new columns for metric1 and metric 2 based on whether lambdaE is TRUE or FALSE.
The data frame would look something like this:
x2<-data.frame(cbind(numsp=rep(c(16,64,256),each=6),Colless=rep(c("hiIc","loIc","midIc"),each=2, times=3), ntree=rep(c(1,2),length.out=18), metric1.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20), metric2.lambdE.FALSE=c(11,12,3,4,7,8,35,36,27,28,31,32,23,24,15,16,19,20),metric1.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18), metric2.lambdE.TRUE=c(9,10,1,2,5,6,33,34,25,26,29,30,21,22,13,14,17,18)))
Or alternatively for the parameter "Colless", a new set of columns for metric1 and metric2 for each level of Colless.
Thanks in advance!
Okay, looks like library reshape2 has a quick solution:
reshape(x, direction="wide", idvar=c("numsp","Colless","ntree"), timevar="lambdaE")
melt and dcast of reshape2 can also be used:
library(reshape2)
mm =melt(x, id=c('numsp','Colless','lambdaE','ntree'))
dcast(mm, numsp+Colless+ntree~lambdaE+variable)
numsp Colless ntree FALSE_metric1 FALSE_metric2 TRUE_metric1 TRUE_metric2
1 16 hiIc 1 11 11 9 9
2 16 hiIc 2 12 12 10 10
3 16 loIc 1 3 3 1 1
4 16 loIc 2 4 4 2 2
5 16 midIc 1 7 7 5 5
6 16 midIc 2 8 8 6 6
7 256 hiIc 1 35 35 33 33
8 256 hiIc 2 36 36 34 34
9 256 loIc 1 27 27 25 25
10 256 loIc 2 28 28 26 26
11 256 midIc 1 31 31 29 29
12 256 midIc 2 32 32 30 30
13 64 hiIc 1 23 23 21 21
14 64 hiIc 2 24 24 22 22
15 64 loIc 1 15 15 13 13
16 64 loIc 2 16 16 14 14
17 64 midIc 1 19 19 17 17
18 64 midIc 2 20 20 18 18
I have different dataframes with a column in which there are the latitudes (latitude) of some records and in another column of the same dataframe the date of the records (datecollected).
I would like to count and export in a new dataframe the number of the records in the same intervals of latitude (5 degrees) and year (two years).
(Hint: you'll make it easier for us to answer by providing some sample data.)
dataset <- data.frame(datecollected=
sample(as.Date("2000-01-01")+(0:3650),1000,replace=TRUE),
latitude=90*runif(1000))
We round the datecollected down to the next even year:
year.index <- (as.POSIXlt(dataset$datecollected)$year %/% 2)*2+1900
Similarly, we round the latitude down to the nearest multiple of 5 degrees:
latitude.index <- (floor(dataset$latitude) %/% 5)*5
Then we simply build a table on the rounded years and latitudes:
table(year.index,latitude.index)
latitude.index
year.index 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
2000 12 9 15 7 11 10 11 14 9 13 11 10 8 11 13 25 10 18
2002 11 9 11 16 11 15 12 5 12 13 7 15 8 7 11 7 10 13
2004 8 12 9 10 12 16 12 13 9 7 16 11 6 13 4 15 12 10
2006 14 8 13 10 12 9 12 9 6 11 11 9 13 9 10 5 5 12
2008 8 12 17 12 12 8 12 8 14 12 11 11 10 10 14 16 17 13
EDIT: after a bit of discussion in the comments, I'll post my current script. It seems like there may be an issue when you read the data into R. This is what I do and what I get:
rm(list=ls())
dataset <- read.csv("GADUS.csv",header=TRUE,sep=",")
year.index <- (as.POSIXlt(as.character(dataset$datecollected),format="%Y-%m-%d")$year
%/% 2)*2+1900
latitude.index <- (floor(dataset$latitude) %/% 5)*5
table(year.index,latitude.index)
latitude.index
year.index 0 5 20 35 40 45 50 55 60 65 70 75
1752 0 0 0 0 0 20 0 0 0 0 0 0
1754 0 0 0 0 0 27 0 3 0 0 0 0
1756 0 0 0 0 0 21 0 1 0 0 0 0
1758 0 0 0 0 0 46 0 2 0 0 0 0
...
Does this give the same result for you? If not, please edit your question and post the result of str(dataset[,c("datecollected","latitude")]).
I have a dataframe with 20 classrooms [1 to 20] indexes and 20 different number of students in each class, how to obtain all sub-samples of size n = 8 and store them because i want to use them later for calculations. I used combn() but that takes only one vector, can i use it with a dataframe and how? (sorry but i'm new in R),
dataframe below:
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
9 9 32
10 10 26
11 11 27
12 12 34
13 13 27
14 14 28
15 15 33
16 16 21
17 17 36
18 18 24
19 19 19
20 20 32
It is as simple as passing a function to combn. simplify = FALSE means that a list will be returned.
Assuming you want all possible combinations of 8 classrooms from the dataset classrooms
combinations <- combn(nrow(classrooms), 8, function(x,data) data[x,],
simplify = FALSE, data =classrooms )
head(combinations, n = 2)
[[1]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
8 8 22
[[2]]
classrooms students
1 1 29
2 2 30
3 3 35
4 4 28
5 5 32
6 6 20
7 7 25
9 9 32