generalized aggregate by row - r

I would like to aggregate by row. I know how to do this and have answered several questions here from others asking for help doing it. However, I want to generalize the aggregate formula and ideally not have the aggregated rows in a different order than they first appear in the original data set.
Here is an example set:
my.data <- read.table(text = '
0 0 0 1
0 0 0 1
2 2 2 2
2 2 2 2
0 4 0 0
0 4 0 0
2 2 0 0
2 2 0 0
2 2 0 0
2 2 0 0
', header = FALSE)
and my desired result:
desired.result <- read.table(text = '
0 0 0 1 2
2 2 2 2 2
0 4 0 0 2
2 2 0 0 4
', header = FALSE)
Here is one way to obtain the answer, albeit the rows are not in their original order:
my.data[,(ncol(my.data)+1)] = 1
aggregate(V5 ~ V1 + V2 + V3 + V4, FUN = sum, data=my.data)
V1 V2 V3 V4 V5
1 2 2 0 0 4
2 0 4 0 0 2
3 0 0 0 1 2
4 2 2 2 2 2
Here is an unsuccessful attempt to generalize the aggregate formula:
with(my.data, aggregate(my.data[,ncol(my.data)], by = list(paste0('V', seq(1, ncol(my.data)-1))), FUN = sum))
The order of the result is less important than the generalization.
Thank you for any advice.

Since it turned out that the desired result is just the frequency counts of unique rows, you could/should use table (as mentioned in the comments). table uses factor on its arguments and factor, if "levels" is not specified, sorts its input's unique (unique does not sort) to specify the levels. So, for table to "see" your levels (i.e. the desired order of rows) you need to call table on an explicitly specified factor.
tmp = do.call(paste, my.data)
as.data.frame(table(tmp))
# tmp Freq
#1 0 0 0 1 2
#2 0 4 0 0 2
#3 2 2 0 0 4
#4 2 2 2 2 2
res = table(factor(tmp, unique(tmp)))
as.data.frame(res)
# Var1 Freq
#1 0 0 0 1 2
#2 2 2 2 2 2
#3 0 4 0 0 2
#4 2 2 0 0 4
Instead of calling as.data.frame.table -where your rows have been concatenated- you could take advantage of unique.data.frame and use a call like:
data.frame(unique(my.data), unclass(res))
# V1 V2 V3 V4 unclass.res.
#1 0 0 0 1 2
#3 2 2 2 2 2
#5 0 4 0 0 2
#7 2 2 0 0 4

It might be useful to mention that the count function in the plyr package can also aggregate this quickly. Although, you still would lose the original order of rows.
> library(plyr)
> x <- count(my.data)
> x
## V1 V2 V3 V4 freq
## 1 0 0 0 1 2
## 2 0 4 0 0 2
## 3 2 2 0 0 4
## 4 2 2 2 2 2
To order the table as desired.result shows (and borrowing a snippet from #alexis_laz),
> pst <- do.call(paste, my.data)
> x[order(x$freq, as.factor(unique(pst))), ]
## V1 V2 V3 V4 freq
## 1 0 0 0 1 2
## 4 2 2 2 2 2
## 2 0 4 0 0 2
## 3 2 2 0 0 4

I like the posted answers, especially the answer by #alexis_laz since I tend to prefer base R. However, here is a general answer using aggregate. The order of the rows in the output differs from the order of their first appearance in the original data set, but at least the rows are tallied:
I borrowed the . in aggregate from #alexis_laz's comment:
my.data <- read.table(text = '
0 0 0 1
0 0 0 1
2 2 2 2
2 2 2 2
0 4 0 0
0 4 0 0
2 2 0 0
2 2 0 0
2 2 0 0
2 2 0 0
', header = FALSE)
my.data
my.count = rep(1, nrow(my.data))
my.count
aggregate(my.count ~ ., FUN = sum, data=my.data)
V1 V2 V3 V4 my.count
1 2 2 0 0 4
2 0 4 0 0 2
3 0 0 0 1 2
4 2 2 2 2 2

Related

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Converting data to longitudinal data

Hi i am having difficulties trying to convert my data into longitudinal data using the Reshape package. Would be grateful if anyone could help me, thank you!
Data is as follows:
m <- matrix(sample(c(0, 0:), 100, replace = TRUE), 10)
ID<-c(1:10)
dim(ID)=c(10,1)
m<- cbind(ID,m)
d <- as.data.frame(m)
names(d)<-c('ID', 'litter1', 'litter2', 'litter3', 'litter4', 'litter5', 'litter6', 'litter7', 'litter8', 'litter9', 'litter10')
print(d)
ID litter1 litter2 litter3 litter4 litter5 litter6 litter7 litter8 litter9 litter10
1 0 0 0 3 1 0 2 0 0 3
2 0 2 1 2 0 0 0 2 0 0
3 1 0 1 2 0 3 3 3 2 0
4 2 1 2 3 0 2 3 3 1 0
5 0 1 2 0 0 0 3 3 1 0
6 2 1 2 0 3 3 0 0 0 0
7 0 1 0 3 0 0 1 2 2 0
8 0 1 3 3 2 1 3 2 3 0
9 0 2 0 2 2 3 2 0 0 3
10 2 2 2 2 1 3 0 3 0 0
I wish to convert the above data into a longitudinal data with columns 'ID', 'litter category' which tells us the category of the litter, i.e. 1-10 and 'litter number' which tells us the number of pieces for each litter category:
ID littercategory litternumber
1 4 3
1 5 1
1 7 2
1 10 3
2 2 2
2 3 1
2 4 2
2 8 2
and so on.
Would really appreciate your help thank you!
You could do that as follows:
library(reshape2)
d = melt(d, id.vars=c("ID"))
colnames(d) = c('ID','littercategory','litternumber')
# remove the text in the littercategory column, keep only the number.
d$littercategory = gsub('litter','',d$littercategory)
d = d[d$litternumber!=0]
Output:
ID littercategory litternumber
1 1 4
2 1 8
3 1 6
4 1 4
7 1 6
8 1 5
10 1 10
1 2 6
2 2 9
As you can see, only the ordering is different as the output you requested, but I'm sure you can fix that yourself. (If not, there are plenty of resources on how to do that).
Hope this helps!
To get desired output you have to melt your data and filter out values larger than 0.
library(data.table)
result <- setDT(melt(d, "ID"))[value != 0][order(ID)]
# To get exact structure modify result
result[, .(ID,
littercategory = sub("litter", "", variable),
litternumber = value)]

Selecting rows of a dataframe according to the correspondence of two covariates' levels

I am currently working on two different dataframes, one of which is extremely long (long). What I need to do is to select all the rows of long whose corresponding id_type appears at least once in the other (smaller) dataset.
Suppose the two dataframes are:
long <- read.table(text = "
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
3 0 0
3 0 1
3 1 0
3 1 1
4 0 0
4 0 1
4 1 0
4 1 1",
header=TRUE)
and
short <- read.table(text = "
id_type y1 y2
1 5 6
1 5 5
2 7 9",
header=TRUE)
In practice, what I am trying to obtain is:
id_type x1 x2
1 0 0
1 0 1
1 1 0
1 1 1
2 0 0
2 0 1
2 1 0
2 1 1
I have tried to use out <- long[long[,"id_type"]==short[,"id_type"], ], but it is clearly wrong. How would you proceed? Thanks
Just use %in%:
out <- long[long$id_type %in% short$id_type, ]
Look at ?"%in%".
You where missing %in%:
> long[long$id_type %in% unique(short$id_type),]
id_type x1 x2
1 1 0 0
2 1 0 1
3 1 1 0
4 1 1 1
5 2 0 0
6 2 0 1
7 2 1 0
8 2 1 1

randomly sum values from rows and assign them to 2 columns in R

I have a data.frame with 8 columns. One is for the list of subjects (one row per subject) and the other 7 rows are a score of either 1 or 0.
This is what the data looks like:
>head(splitkscores)
subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1
I want to create a data.frame with 3 columns. One column for subjects. In the other two columns, one must have the sum of 3 or 4 randomly chosen numbers from each row of my data.frame (except the subject) and the other column must have the sum of the remaining values which were not chosen in the first random sample.
Help is much appreciated.
Thanks in advance
Here's a neat and tidy solution free of unnecessary complexity (assume the input is called df):
chosen=sort(sample(setdiff(colnames(df),"subject"),sample(c(3,4),1)))
notchosen=setdiff(colnames(df),c("subject",chosen))
out=data.frame(subject=df$subject,
sum1=apply(df[,chosen],1,sum),sum2=apply(df[,notchosen],1,sum))
In plain English: sample from the column names other than "subject", choosing a sample size of either 3 or 4, and call those column names chosen; define notchosen to be the other columns (excluding "subject" again, obviously); then return a data frame with the list of subjects, the sum of the chosen columns, and the sum of the non-chosen columns. Done.
I think this'll do it: [changed the way data were read in based on the other response because I made a manual mistake...]
splitkscores <- read.table(text = " subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1", header = TRUE)
df2 <- data.frame(subject = splitkscores$subject, sum3or4 = NA, leftover = NA)
df2$sum3or4 <- apply(splitkscores[,2:ncol(splitkscores)], 1, function(x){
sum(sample(x, sample(c(3,4),1), replace = FALSE))
})
df2$leftover <- rowSums(splitkscores[,2:ncol(splitkscores)]) - df2$sum3or4
df2
subject sum3or4 leftover
1 40002 1 0
2 40002 2 1
3 40002 3 4
4 40002 1 2
5 40002 2 1
6 40002 1 4

Creating dummy variables (n-1) categories

I found similar entries but not exactly what I want. For two categorized variable (e.g., gender(1,2)), I need to create a dummy variable, 0s being male and 1s being female.
Here how my data look like and what I did.
data <- as.data.frame(as.matrix(c(1,2,2,1,2,1,1,2),8,1))
V1
1 1
2 2
3 2
4 1
5 2
6 1
7 1
8 2
library(dummies)
data <- cbind(data, dummy(data$V1, sep = "_"))
> data
V1 data_1 data_2
1 1 1 0
2 2 0 1
3 2 0 1
4 1 1 0
5 2 0 1
6 1 1 0
7 1 1 0
8 2 0 1
In this code, the second category is also (0,1). Also, is there a way to determine which to determine the baseline (assigning 0 to any category)?
I want it to look like this:
> data
V1 V1_dummy
1 1 0
2 2 1
3 2 1
4 1 0
5 2 1
6 1 0
7 1 0
8 2 1
Also, I want to extend this to three category variables, having two categories after recoding (n-1).
Thanks in advance!
You can use model.matrix in the following way. Some sample data with a three level factor:
set.seed(1)
(df <- data.frame(x = factor(rbinom(5, 2, 0.4))))
# x
# 1 0
# 2 1
# 3 1
# 4 2
# 5 0
Then
model.matrix(~ x, df)[, -1]
# x1 x2
# 1 0 0
# 2 1 0
# 3 1 0
# 4 0 1
# 5 0 0
If you want to specify which group disappears, we need to rearrange the factor levels. It is the first group that disappears. So, e.g.,
levels(df$x) <- c("1", "0", "2")
model.matrix(~x, df)[, -1]
# x0 x2
# 1 0 0
# 2 1 0
# 3 1 0
# 4 0 1
# 5 0 0

Resources