R - Group data but apply different functions to different columns - r

I'd like to group this data but apply different functions to some columns when grouping.
ID type isDesc isImage
1 1 1 0
1 1 0 1
1 1 0 1
4 2 0 1
4 2 1 0
6 1 1 0
6 1 0 1
6 1 0 0
I want to group by ID, columns isDesc and isImage can be summed, but I would like to get the value of type as it is. type will be the same through the whole dataset. The result should look like this:
ID type isDesc isImage
1 1 1 2
4 2 1 1
6 1 1 1
Currently I am using
library(plyr)
summarized = ddply(data, .(ID), numcolwise(sum))
but it simply sums up all the columns. You don't have to use ddply but if you think it's good for the job I'd like to stick to it. data.table library is also an alternative

Using data.table:
require(data.table)
dt <- data.table(data, key="ID")
dt[, list(type=type[1], isDesc=sum(isDesc),
isImage=sum(isImage)), by=ID]
# ID type isDesc isImage
# 1: 1 1 1 2
# 2: 4 2 1 1
# 3: 6 1 1 1
Using plyr:
ddply(data , .(ID), summarise, type=type[1], isDesc=sum(isDesc), isImage=sum(isImage))
# ID type isDesc isImage
# 1 1 1 1 2
# 2 4 2 1 1
# 3 6 1 1 1
Edit: Using data.table's .SDcols, you can do this in case you've too many columns that are to be summed, and other columns to be just taken the first value.
dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
> dt2[dt1]
# ID type isDesc isImage
# 1: 1 1 1 2
# 2: 4 2 1 1
# 3: 6 1 1 1
You can provide column names or column numbers as arguments to .SDcols. Ex: .SDcols=c("type") is also valid.

Related

data.table list .N (count) removes row completely rather than imputing 0 count [duplicate]

This question already has answers here:
Empty factors in "by" data.table
(2 answers)
Closed 7 years ago.
Say I have this data.table:
df <- data.frame(ID = c("A","A","A","A","B","B","B","B"),
Flag = c(1,1,1,1,0,0,0,1))
df <- data.table(df)
df
ID Flag
1: A 1
2: A 1
3: A 1
4: A 1
5: B 0
6: B 0
7: B 0
8: B 1
and I wish to count the number of 0 and 1 flags for each ID, as such:
Summary <- df[, list(Count = .N), by = c("ID","Flag")]
this returns the following results:
Summary
ID Flag Count
1: A 1 4
2: B 0 3
3: B 1 1
So, since there are no 0's recorded against ID A, there is no row which lists the combination of ID A and Flag 0, with a count of zero.
What would be the way to do this using data.table?
I.e.. I want to achieve this result:
Summary
ID Flag Count
1 A 0 0
2 A 1 4
3 B 0 3
4 B 1 1
Thanks!
You could factor the column, then tabulate. Since we know we only want 0 and 1 levels, we can just use 0:1 for the Flag column and not need to actually assign the Flag column as a factor. Although this method would be slower (see second part).
df[, .(Flag = 0:1, Count = tabulate(factor(Flag, levels = 0:1))), by = ID]
# ID Flag Count
# 1: A 0 0
# 2: A 1 4
# 3: B 0 3
# 4: B 1 1
As thelatemail notes in the comments, a faster method would be to factor the whole column first, then tabulate based on ID.
df[, Flag := factor(Flag, levels = 0:1)]
df[, .(Flag = levels(Flag), Count = tabulate(Flag)), by = ID]
# ID Flag Count
# 1: A 0 0
# 2: A 1 4
# 3: B 0 3
# 4: B 1 1

Grouping and Counting instances?

Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.

R aggregate on large number of columns without specifying column names

I could not find an answer to my question using the search function here nor on Google.
I have a data frame (500 columns wide, 200.000 rows long) with multiple rows per person. Each cell (except for the first column which has a person ID) contains a 0 or a 1. I am looking for a way to reduce this data frame to 1 row per person, in which I take the maximum for each column by person.
I know that I could use ddply, or data.table... like below...
tt <-data.frame(person=c(1,1,1,2,2,2,3,3,3), col1=c(0,0,1,1,1,0,0,0,0),col2=c(1, 1, 0, 0, 0, 0, 1 ,0 ,1))
library(plyr)
ddply(tt, .(person), summarize, col1=max(col1), col2=max(col2))
person col1 col2
1 1 1
2 1 0
3 0 1
But I don't want to be specifying each of my column names because 1) I have 500 and 2) on a new data set they might be different.
Use the summarise_each function from dplyr
library(dplyr)
tt %>% group_by(person) %>% summarise_each(funs(max))
# person col1 col2
# 1 1 1 1
# 2 2 1 0
# 3 3 0 1
or just the base aggregate function
aggregate(.~person, tt, max)
# person col1 col2
# 1 1 1 1
# 2 2 1 0
# 3 3 0 1
Or use data.table.
library(data.table)
setDT(tt)[, lapply(.SD, max), person]
# person col1 col2
#1: 1 1 1
#2: 2 1 0
#3: 3 0 1
Below is another trial just using l(s)apply().
t(sapply(unique(tt$person), function(x) lapply(tt[tt$person==x,], max)))
person col1 col2
[1,] 1 1 1
[2,] 2 1 0
[3,] 3 0 1

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

Computing contingency tables in R language

I have a data set in R as follows
ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1
I need to get the output table for it as under
Id Variable1-1 Variable1-2 Variable2-1 Variable2-2
1 1 0 0 1
2 0 2 2 0
Note that only those rows are counted where the choice is 1 (choice is a binary variable, however other variables have any integer values). The aim is to have as many columns for a variable as its levels.
Is there a way I can do this in R?
You could use melt and dcast from the reshape2 package:
mydf<-read.table(text="ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1",header=TRUE)
library(reshape2)
First melt the data.frame, selecting only those rows where Choice == 1 and removing the Choice column
mydfM <- melt(mydf[mydf$Choice %in% 1, -match("Choice", names(mydf))], id = "ID")
# EDIT above: As #TylerRinker points out, using which could be avoided.
# I've replaced it with %in%
# ID variable value
# 1 1 Variable1 1
# 2 2 Variable1 2
# 3 2 Variable1 2
# 4 1 Variable2 2
# 5 2 Variable2 1
# 6 2 Variable2 1
Then cast the melted data.frame, using length as the aggregation function
(mydfC <- dcast(mydfM, ID ~ variable + value, fun.aggregate = length))
# ID Variable1_1 Variable1_2 Variable2_1 Variable2_2
# 1 1 1 0 0 1
# 2 2 0 2 2 0
It took me a while to figure out what you were after but I got it (I think). I have done what you asked but it's convoluted at best. I think this will help others see what you're after and you'll get better answers now.
dat <- read.table(text="ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1", header=T)
A <- split(dat$Choice, list(dat$Variable1, dat$ID))
B <- split(dat$Choice, list(dat$Variable2, dat$ID))
C <- list(A, B)
FUN <- function(x) sapply(x, function(y) sum(y))
FUN2 <- function(x){
len <- length(x)/2
rbind(x[1:len], x[(len+1):length(x)])
}
dat2 <- do.call('data.frame', lapply(lapply(C, FUN), FUN2))
colnames(dat2) <- c('Variable1-1', 'Variable1-2', 'Variable2-1',
'Variable2-2')
dat2
This ain't you're grandmother's contingency table that's for sure. Probably there's a much better way to accomplish all of this, maybe with reshape.

Resources