Appending a tally of subjects to a dataframe in R - r

I have a list of subjects:
myDat = list(Subject = c(10234, 10234, 10234, 10234, 10242, 10242, 10242, 10242, 10253, 10253, 10253, 10268, 10268, 10268, 10268))
and I would like to add a count (DayNo) which restarts with every change in subject to the dataframe to look like:
Thanks in advance

An ave variant:
df <- as.data.frame(myDat)
df$Day <- ave(df$Subject, df$Subject, FUN=seq_along)
Produces:
Subject Day
1 10234 1
2 10234 2
3 10234 3
4 10234 4
5 10242 1
6 10242 2
7 10242 3
8 10242 4
9 10253 1
10 10253 2
11 10253 3
12 10268 1
13 10268 2
14 10268 3
15 10268 4

Use rle to get the run lengths and use sequence to create sequences of corresponding length.
myDat <- as.data.frame(myDat)
myDat$DayNo <- sequence(rle(myDat$Subject)$lengths)
# Subject DayNo
# 1 10234 1
# 2 10234 2
# 3 10234 3
# 4 10234 4
# 5 10242 1
# 6 10242 2
# 7 10242 3
# 8 10242 4
# 9 10253 1
# 10 10253 2
# 11 10253 3
# 12 10268 1
# 13 10268 2
# 14 10268 3
# 15 10268 4

Related

Simulate unbalanced clustered data

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
EDIT
This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?
As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.
#First create an oversampled dataset:
y <- rnorm(24)
x <- rnorm(24)
z <- rep(1:6, 4)
w <- rep(1:4, each=6)
df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
df %>% slice_sample(n = 20) %>%
arrange(cluster)
#Or just use base R
a <- df[sample(nrow(df), 20), ]
df2 <- a[order(a$cluster), ]
Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.
FUN <- function(ncl=20, mnobs=30, tol=.1) {
S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
repeat({
CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
if (sum(lengths(CL)) == ncl*mnobs) break
})
L <- lapply(seq.int(CL), function(i) {
id <- seq.int(CL[[i]])
cbind(id, cluster=i,
matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
})
do.call(rbind.data.frame, L)
}
Usage
set.seed(42)
res <- FUN() ## using defined `arg` defaults
dim(res)
# [1] 600 4
(res.tab <- table(res$cluster))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30
table(res.tab)
# 27 28 29 30 31 32 33
# 2 3 2 4 5 3 1
sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
# mean sd
# 30.000000 1.747178
Displayable example
set.seed(42)
FUN(4, 5, tol=.3) ## tol needs to be adjusted for smaller samples
# id cluster x y
# 1 1 1 1.51152200 -0.0627141
# 2 2 1 -0.09465904 1.3048697
# 3 3 1 2.01842371 2.2866454
# 4 1 2 -1.38886070 -2.4404669
# 5 2 2 -0.27878877 1.3201133
# 6 3 2 -0.13332134 -0.3066386
# 7 4 2 0.63595040 -1.7813084
# 8 5 2 -0.28425292 -0.1719174
# 9 6 2 -2.65645542 1.2146747
# 10 1 3 1.89519346 -0.6399949
# 11 2 3 -0.43046913 0.4554501
# 12 3 3 -0.25726938 0.7048373
# 13 4 3 -1.76316309 1.0351035
# 14 5 3 0.46009735 -0.6089264
# 15 1 4 0.50495512 0.2059986
# 16 2 4 -1.71700868 -0.3610573
# 17 3 4 -0.78445901 0.7581632
# 18 4 4 -0.85090759 -0.7267048
# 19 5 4 -2.41420765 -1.3682810
# 20 6 4 0.03612261 0.4328180

problem in changing matrix to a data frame with same dimensions

I have tried to create a data frame from a matrix; however, the result has a different dimension comparing to the main matrix. Please see below my code:
out <- table(UL_Final$Issue_Year, UL_Final$Insured_Age_Group)
out <- out/rowSums(out) #changing all numbers to ratio
The result is a matrix 12 by 7:
1 2 3 4 5 6 7
1387 0.165137615 0.036697248 0.229357798 0.321100917 0.201834862 0.018348624 0.027522936
1388 0.149222065 0.110325318 0.197312588 0.342291372 0.136492221 0.055162659 0.009193777
1389 0.144979508 0.101946721 0.222848361 0.335553279 0.138575820 0.046362705 0.009733607
1390 0.146991622 0.120030465 0.191622239 0.336024372 0.142269612 0.052551409 0.010510282
1391 0.165462754 0.111794582 0.185835214 0.321049661 0.135553047 0.064503386 0.015801354
1392 0.162399144 0.109583402 0.165321917 0.317388441 0.146344476 0.076115594 0.022847028
1393 0.181602139 0.116447173 0.151104070 0.325131201 0.148628577 0.062778493 0.014308347
1394 0.163760504 0.098529412 0.142489496 0.323792017 0.178728992 0.076050420 0.016649160
1395 0.137097032 0.094699511 0.128981757 0.321320170 0.197610147 0.098245950 0.022045433
1396 0.167187958 0.103851041 0.112696706 0.293202033 0.200689082 0.099306031 0.023067149
1397 0.193250090 0.130540713 0.108114843 0.270743930 0.186411584 0.091364656 0.019574185
1398 0.208026156 0.147573562 0.100455157 0.249503173 0.191935380 0.083338676 0.019167895
then using the code below:
out <- data.frame(out)
However, the result will change to a data frame and dimension of 84 by 3
Var1 Var2 Freq
1 1387 1 0.165137615
2 1388 1 0.149222065
3 1389 1 0.144979508
4 1390 1 0.146991622
5 .... .......
I am not sure why this happens. However in another case, as I explained below, I am not seeing such strange behavior. In another case, I used the code below to calculate another ratio for another variable:
out <- table( df_select$Insured_Age_Group,df_select$Policy_Status)
out <- cbind(out, Ratio = out[,2]/rowSums(out))
the result is :
Issuance Surrended Ratio
1 31046 5735 0.1559229
2 20039 4409 0.1803420
3 20399 9228 0.3114726
4 48677 17216 0.2612721
5 30045 8132 0.2130078
6 13947 4106 0.2274414
7 3157 1047 0.2490485
Now if we used the code below (by #Ronak Shah):
out <- data.frame(out) %>% mutate(x = row_number())
the result is :
Issuance Surrended Ratio x
1 31046 5735 0.1559229 1
2 20039 4409 0.1803420 2
3 20399 9228 0.3114726 3
4 48677 17216 0.2612721 4
5 30045 8132 0.2130078 5
6 13947 4106 0.2274414 6
7 3157 1047 0.2490485 7
As you can see the result is now a data frame with same dimension. Can anyone explain why this happens?
See ?table for an explanation:
The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs.
A workaround is to use as.data.frame.matrix:
m <- table(mtcars$carb, mtcars$gear)
as.data.frame(m)
# Var1 Var2 Freq
# 1 1 3 3
# 2 2 3 4
# 3 3 3 3
# 4 4 3 5
# 5 6 3 0
# 6 8 3 0
# 7 1 4 4
# 8 2 4 4
# 9 3 4 0
# 10 4 4 4
# 11 6 4 0
# 12 8 4 0
# 13 1 5 0
# 14 2 5 2
# 15 3 5 0
# 16 4 5 1
# 17 6 5 1
# 18 8 5 1
as.data.frame.matrix(m)
# 3 4 5
# 1 3 4 0
# 2 4 4 2
# 3 3 0 0
# 4 5 4 1
# 6 0 0 1
# 8 0 0 1

Convert year-month string to three month bins with gaps - how to assign contiguous ascending values?

I have used the code below to "bin" a year.month string into three month bins. The problem is that I want each of the bins to have a number that corresponds where the bin occurs chronologically (i.e. first bin =1, second bin=2, etc.). Right now, the first month bin is assigned to the number 4, and I am not sure why. Any help would be highly appreciated!
> head(Master.feed.parts.gn$yr.mo, n=20)
[1] "2007.10" "2007.10" "2007.10" "2007.11" "2007.11" "2007.11" "2007.11" "2007.12" "2008.01"
[10] "2008.01" "2008.01" "2008.01" "2008.01" "2008.02" "2008.03" "2008.03" "2008.03" "2008.04"
[19] "2008.04" "2008.04"
>
> yearmonth_to_integer <- function(xx) {
+ yy_mm <- as.integer(unlist(strsplit(xx, '.', fixed=T)))
+ return( (yy_mm[1] - 2006) + (yy_mm[2] %/% 3) )
+ }
>
> Cluster.GN <- sapply(Master.feed.parts.gn$yr.mo, yearmonth_to_integer)
> Cluster.GN
2007.10 2007.10 2007.10 2007.11 2007.11 2007.11 2007.11 2007.12 2008.01 2008.01 2008.01
4 4 4 4 4 4 4 5 2 2 2
2008.01 2008.01 2008.02 2008.03 2008.03 2008.03 2008.04 2008.04 2008.04 2008.04 2008.05
2 2 2 3 3 3 3 3 3 3 3
2008.05 2008.05 2008.06 2008.10 2008.11 2008.11 2008.12 <NA> 2009.05 2009.05 2009.05
3 3 4 5 5 5 6 NA 4 4 4
2009.06 2009.07 2009.07 2009.07 2009.09 2009.10 2009.11 2010.01 2010.02 2010.02 2010.02
5 5 5 5 6 6 6 4 4 4 4
UPDATE:
I was asked to provide sample input (year) and the desired output (Cluster.GN).I have a year-month string that has varying numbers of observations for each month, and some months don't have any observations. What I want to do is bin each of the three consecutive months that have data, assigning each three month "bin" a number as shown below.
yr.mo Cluster.GN
1 2007.10 1
2 2007.10 1
3 2007.10 1
4 2007.10 1
5 2007.10 1
6 2007.11 1
7 2007.11 1
8 2007.11 1
9 2007.11 1
10 2007.12 1
11 2007.12 1
12 2007.12 1
13 2007.12 1
14 2008.10 2
15 2008.10 2
16 2008.10 2
17 2008.10 2
18 2008.12 2
19 2008.12 2
20 2008.12 2
21 2008.12 2
22 2008.12 2
1) Convert the strings to zoo's "yearqtr" class and then to integers:
s <- c("2007.10", "2007.10", "2007.10", "2007.11", "2007.11", "2007.11",
"2007.11", "2007.12", "2008.01", "2008.01", "2008.01", "2008.01",
"2008.01", "2008.02", "2008.03", "2008.03", "2008.03", "2008.04",
"2008.04", "2008.04")
library(zoo)
yq <- as.yearqtr(s, "%Y.%m")
as.numeric(factor(yq))
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
The last line could alternately be: 4*(yq - yq[1])+1
Note that in the question 2007.12 is classified as in a different quarter than 2007.10 and 2007.11; however, they are all in the same quarter and we assume you did not intend this.
2) Another possibility depending on what you want is:
f <- factor(s)
nlev <- nlevels(f)
levels(f) <- gl(nlev, 3, nlev)
f
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
## Levels: 1 2 3
IF there are missing months then this will give a different answer than (1) so it all depends on what you are looking for.

Conditional column-wise or row-wise subtraction in a data frame

I need to do a column-wise subtraction and row-wise subtraction in R.
id on fail
1 10-10-2014 11-11-2014
1 11-10-2014 12-12-2014
1 12-10-2014 12-01-2015
2 13-10-2014 12-02-2015
2 14-10-2014 15-03-2015
2 15-10-2014 15-04-2015
2 16-10-2014 16-05-2015
3 17-10-2014 16-06-2015
3 18-10-2014 17-07-2015
3 19-10-2014 17-08-2015
3 20-10-2014 17-09-2015
For example, in the above table whenever a new id appears it should do a column-wise subtraction, else it should do row-wise subtraction. I need to have a result like this:
id on fail res
1 10-10-2014 11-11-2014 32
1 11-10-2014 12-12-2014 31
1 12-10-2014 12-01-2015 31
2 13-10-2014 12-02-2015 122
2 14-10-2014 15-03-2015 31
2 15-10-2014 15-04-2015 31
2 16-10-2014 16-05-2015 31
3 17-10-2014 16-06-2015 242
3 18-10-2014 17-07-2015 31
3 19-10-2014 17-08-2015 31
3 20-10-2014 17-09-2015 31
As of now I am using the following code:
data[,2] <- as.Date(data[,2],format="%d-%m-%Y")
data[,3] <- as.Date(data[,3],format="%d-%m-%Y")
x <- as.numeric(diff(data[,3]))
DF <- read.table(text="id on fail
1 10-10-2014 11-11-2014
1 11-10-2014 12-12-2014
1 12-10-2014 12-01-2015
2 13-10-2014 12-02-2015
2 14-10-2014 15-03-2015
2 15-10-2014 15-04-2015
2 16-10-2014 16-05-2015
3 17-10-2014 16-06-2015
3 18-10-2014 17-07-2015
3 19-10-2014 17-08-2015
3 20-10-2014 17-09-2015 ", header=TRUE)
DF[,2:3] <- lapply(DF[,2:3], as.Date, format="%d-%m-%Y")
DF$res <- c(NA, diff(DF$fail))
DF[c(TRUE ,diff(DF$id)!=0), "res"] <- DF[c(TRUE ,diff(DF$id)!=0), "fail"] -
DF[c(TRUE ,diff(DF$id)!=0), "on"]
# id on fail res
# 1 1 2014-10-10 2014-11-11 32
# 2 1 2014-10-11 2014-12-12 31
# 3 1 2014-10-12 2015-01-12 31
# 4 2 2014-10-13 2015-02-12 122
# 5 2 2014-10-14 2015-03-15 31
# 6 2 2014-10-15 2015-04-15 31
# 7 2 2014-10-16 2015-05-16 31
# 8 3 2014-10-17 2015-06-16 242
# 9 3 2014-10-18 2015-07-17 31
# 10 3 2014-10-19 2015-08-17 31
# 11 3 2014-10-20 2015-09-17 31

R: grouped data table with proportions

I have copied my code below. I start with a list of 50 small integers, representing the number of televisions owned by 50 families. My objective is shown in the object 'tv.final' below. My effort seems very wordy and inefficient.
Question: is there a better way to start with a list of 50 integers and end with a grouped data table with proportions? (Just taking my first baby steps with R, sorry for such a stupid question, but inquiring minds want to know.)
tv.data <- read.table("Tb02-08.txt",header=TRUE)
str(tv.data)
# 'data.frame': 50 obs. of 1 variable:
# $ TVs: int 1 1 1 2 6 3 3 4 2 4 ...
tv.table <- table(tv.data)
tv.table
# tv.data
# 0 1 2 3 4 5 6
# 1 16 14 12 3 2 2
tv.prop <- prop.table(tv.table)*100
tv.prop
# tv.data
# 0 1 2 3 4 5 6
# 2 32 28 24 6 4 4
tvs <- rbind(tv.table,tv.prop)
tvs
# 0 1 2 3 4 5 6
# tv.table 1 16 14 12 3 2 2
# tv.prop 2 32 28 24 6 4 4
tv.final <- t(tvs)
tv.final
# tv.table tv.prop
# 0 1 2
# 1 16 32
# 2 14 28
# 3 12 24
# 4 3 6
# 5 2 4
# 6 2 4
You can treat the object returned by table() as any other vector/matrix:
tv.table <- table(tv.data)
round(100 * tv.table/sum(tv.table))
That will give you the proportions in rounded percentage points.

Resources