Creating dummy variables (n-1) categories - r

I found similar entries but not exactly what I want. For two categorized variable (e.g., gender(1,2)), I need to create a dummy variable, 0s being male and 1s being female.
Here how my data look like and what I did.
data <- as.data.frame(as.matrix(c(1,2,2,1,2,1,1,2),8,1))
V1
1 1
2 2
3 2
4 1
5 2
6 1
7 1
8 2
library(dummies)
data <- cbind(data, dummy(data$V1, sep = "_"))
> data
V1 data_1 data_2
1 1 1 0
2 2 0 1
3 2 0 1
4 1 1 0
5 2 0 1
6 1 1 0
7 1 1 0
8 2 0 1
In this code, the second category is also (0,1). Also, is there a way to determine which to determine the baseline (assigning 0 to any category)?
I want it to look like this:
> data
V1 V1_dummy
1 1 0
2 2 1
3 2 1
4 1 0
5 2 1
6 1 0
7 1 0
8 2 1
Also, I want to extend this to three category variables, having two categories after recoding (n-1).
Thanks in advance!

You can use model.matrix in the following way. Some sample data with a three level factor:
set.seed(1)
(df <- data.frame(x = factor(rbinom(5, 2, 0.4))))
# x
# 1 0
# 2 1
# 3 1
# 4 2
# 5 0
Then
model.matrix(~ x, df)[, -1]
# x1 x2
# 1 0 0
# 2 1 0
# 3 1 0
# 4 0 1
# 5 0 0
If you want to specify which group disappears, we need to rearrange the factor levels. It is the first group that disappears. So, e.g.,
levels(df$x) <- c("1", "0", "2")
model.matrix(~x, df)[, -1]
# x0 x2
# 1 0 0
# 2 1 0
# 3 1 0
# 4 0 1
# 5 0 0

Related

Is there a R function for preparing datasets for survival analysis like stset in Stata?

Datasets look like this
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
As you see, when id = 1, it's just the data input to coxph in survival package. However, when id = 2, at the beginning and end, failure occurs, but in the middle, failure disappears.
Is there a general function to extract data from id = 2 and get the result like id = 1?
I think when id = 2, the result should look like below.
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
A bit hacky, but should get the job done.
Data:
# Load data
library(tidyverse)
df <- read_table("
id start end failure x1
1 0 1 0 0
1 1 3 0 0
1 3 6 1 0
2 0 1 1 1
2 1 3 1 1
2 3 4 0 1
2 4 6 0 1
2 6 7 1 1
")
Data wrangling:
# Check for sub-groups within IDs and remove all but the last one
df <- df %>%
# Group by ID
group_by(
id
) %>%
mutate(
# Check if a new sub-group is starting (after a failure)
new_group = case_when(
# First row is always group 0
row_number() == 1 ~ 0,
# If previous row was a failure, then a new sub-group starts here
lag(failure) == 1 ~ 1,
# Otherwise not
TRUE ~ 0
),
# Assign sub-group number by calculating cumulative sums
group = cumsum(new_group)
) %>%
# Keep only last sub-group for each ID
filter(
group == max(group)
) %>%
ungroup() %>%
# Remove working columns
select(
-new_group, -group
)
Result:
> df
# A tibble: 6 × 5
id start end failure x1
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0
2 1 1 3 0 0
3 1 3 6 1 0
4 2 3 4 0 1
5 2 4 6 0 1
6 2 6 7 1 1

Converting data to longitudinal data

Hi i am having difficulties trying to convert my data into longitudinal data using the Reshape package. Would be grateful if anyone could help me, thank you!
Data is as follows:
m <- matrix(sample(c(0, 0:), 100, replace = TRUE), 10)
ID<-c(1:10)
dim(ID)=c(10,1)
m<- cbind(ID,m)
d <- as.data.frame(m)
names(d)<-c('ID', 'litter1', 'litter2', 'litter3', 'litter4', 'litter5', 'litter6', 'litter7', 'litter8', 'litter9', 'litter10')
print(d)
ID litter1 litter2 litter3 litter4 litter5 litter6 litter7 litter8 litter9 litter10
1 0 0 0 3 1 0 2 0 0 3
2 0 2 1 2 0 0 0 2 0 0
3 1 0 1 2 0 3 3 3 2 0
4 2 1 2 3 0 2 3 3 1 0
5 0 1 2 0 0 0 3 3 1 0
6 2 1 2 0 3 3 0 0 0 0
7 0 1 0 3 0 0 1 2 2 0
8 0 1 3 3 2 1 3 2 3 0
9 0 2 0 2 2 3 2 0 0 3
10 2 2 2 2 1 3 0 3 0 0
I wish to convert the above data into a longitudinal data with columns 'ID', 'litter category' which tells us the category of the litter, i.e. 1-10 and 'litter number' which tells us the number of pieces for each litter category:
ID littercategory litternumber
1 4 3
1 5 1
1 7 2
1 10 3
2 2 2
2 3 1
2 4 2
2 8 2
and so on.
Would really appreciate your help thank you!
You could do that as follows:
library(reshape2)
d = melt(d, id.vars=c("ID"))
colnames(d) = c('ID','littercategory','litternumber')
# remove the text in the littercategory column, keep only the number.
d$littercategory = gsub('litter','',d$littercategory)
d = d[d$litternumber!=0]
Output:
ID littercategory litternumber
1 1 4
2 1 8
3 1 6
4 1 4
7 1 6
8 1 5
10 1 10
1 2 6
2 2 9
As you can see, only the ordering is different as the output you requested, but I'm sure you can fix that yourself. (If not, there are plenty of resources on how to do that).
Hope this helps!
To get desired output you have to melt your data and filter out values larger than 0.
library(data.table)
result <- setDT(melt(d, "ID"))[value != 0][order(ID)]
# To get exact structure modify result
result[, .(ID,
littercategory = sub("litter", "", variable),
litternumber = value)]

Sum rows in a group, starting when a specific value occurs

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)
Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]
You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

generalized aggregate by row

I would like to aggregate by row. I know how to do this and have answered several questions here from others asking for help doing it. However, I want to generalize the aggregate formula and ideally not have the aggregated rows in a different order than they first appear in the original data set.
Here is an example set:
my.data <- read.table(text = '
0 0 0 1
0 0 0 1
2 2 2 2
2 2 2 2
0 4 0 0
0 4 0 0
2 2 0 0
2 2 0 0
2 2 0 0
2 2 0 0
', header = FALSE)
and my desired result:
desired.result <- read.table(text = '
0 0 0 1 2
2 2 2 2 2
0 4 0 0 2
2 2 0 0 4
', header = FALSE)
Here is one way to obtain the answer, albeit the rows are not in their original order:
my.data[,(ncol(my.data)+1)] = 1
aggregate(V5 ~ V1 + V2 + V3 + V4, FUN = sum, data=my.data)
V1 V2 V3 V4 V5
1 2 2 0 0 4
2 0 4 0 0 2
3 0 0 0 1 2
4 2 2 2 2 2
Here is an unsuccessful attempt to generalize the aggregate formula:
with(my.data, aggregate(my.data[,ncol(my.data)], by = list(paste0('V', seq(1, ncol(my.data)-1))), FUN = sum))
The order of the result is less important than the generalization.
Thank you for any advice.
Since it turned out that the desired result is just the frequency counts of unique rows, you could/should use table (as mentioned in the comments). table uses factor on its arguments and factor, if "levels" is not specified, sorts its input's unique (unique does not sort) to specify the levels. So, for table to "see" your levels (i.e. the desired order of rows) you need to call table on an explicitly specified factor.
tmp = do.call(paste, my.data)
as.data.frame(table(tmp))
# tmp Freq
#1 0 0 0 1 2
#2 0 4 0 0 2
#3 2 2 0 0 4
#4 2 2 2 2 2
res = table(factor(tmp, unique(tmp)))
as.data.frame(res)
# Var1 Freq
#1 0 0 0 1 2
#2 2 2 2 2 2
#3 0 4 0 0 2
#4 2 2 0 0 4
Instead of calling as.data.frame.table -where your rows have been concatenated- you could take advantage of unique.data.frame and use a call like:
data.frame(unique(my.data), unclass(res))
# V1 V2 V3 V4 unclass.res.
#1 0 0 0 1 2
#3 2 2 2 2 2
#5 0 4 0 0 2
#7 2 2 0 0 4
It might be useful to mention that the count function in the plyr package can also aggregate this quickly. Although, you still would lose the original order of rows.
> library(plyr)
> x <- count(my.data)
> x
## V1 V2 V3 V4 freq
## 1 0 0 0 1 2
## 2 0 4 0 0 2
## 3 2 2 0 0 4
## 4 2 2 2 2 2
To order the table as desired.result shows (and borrowing a snippet from #alexis_laz),
> pst <- do.call(paste, my.data)
> x[order(x$freq, as.factor(unique(pst))), ]
## V1 V2 V3 V4 freq
## 1 0 0 0 1 2
## 4 2 2 2 2 2
## 2 0 4 0 0 2
## 3 2 2 0 0 4
I like the posted answers, especially the answer by #alexis_laz since I tend to prefer base R. However, here is a general answer using aggregate. The order of the rows in the output differs from the order of their first appearance in the original data set, but at least the rows are tallied:
I borrowed the . in aggregate from #alexis_laz's comment:
my.data <- read.table(text = '
0 0 0 1
0 0 0 1
2 2 2 2
2 2 2 2
0 4 0 0
0 4 0 0
2 2 0 0
2 2 0 0
2 2 0 0
2 2 0 0
', header = FALSE)
my.data
my.count = rep(1, nrow(my.data))
my.count
aggregate(my.count ~ ., FUN = sum, data=my.data)
V1 V2 V3 V4 my.count
1 2 2 0 0 4
2 0 4 0 0 2
3 0 0 0 1 2
4 2 2 2 2 2

Selecting rows of a dataframe according to the correspondence of two covariates' levels

I am currently working on two different dataframes, one of which is extremely long (long). What I need to do is to select all the rows of long whose corresponding id_type appears at least once in the other (smaller) dataset.
Suppose the two dataframes are:
long <- read.table(text = "
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
3 0 0
3 0 1
3 1 0
3 1 1
4 0 0
4 0 1
4 1 0
4 1 1",
header=TRUE)
and
short <- read.table(text = "
id_type y1 y2
1 5 6
1 5 5
2 7 9",
header=TRUE)
In practice, what I am trying to obtain is:
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
I have tried to use out <- long[long[,"id_type"]==short[,"id_type"], ], but it is clearly wrong. How would you proceed? Thanks
Just use %in%:
out <- long[long$id_type %in% short$id_type, ]
Look at ?"%in%".
You where missing %in%:
> long[long$id_type %in% unique(short$id_type),]
id_type x1 x2
1 1 0 0
2 1 0 1
3 1 1 0
4 1 1 1
5 2 0 0
6 2 0 1
7 2 1 0
8 2 1 1

Resources