I could not find an answer to my question using the search function here nor on Google.
I have a data frame (500 columns wide, 200.000 rows long) with multiple rows per person. Each cell (except for the first column which has a person ID) contains a 0 or a 1. I am looking for a way to reduce this data frame to 1 row per person, in which I take the maximum for each column by person.
I know that I could use ddply, or data.table... like below...
tt <-data.frame(person=c(1,1,1,2,2,2,3,3,3), col1=c(0,0,1,1,1,0,0,0,0),col2=c(1, 1, 0, 0, 0, 0, 1 ,0 ,1))
library(plyr)
ddply(tt, .(person), summarize, col1=max(col1), col2=max(col2))
person col1 col2
1 1 1
2 1 0
3 0 1
But I don't want to be specifying each of my column names because 1) I have 500 and 2) on a new data set they might be different.
Use the summarise_each function from dplyr
library(dplyr)
tt %>% group_by(person) %>% summarise_each(funs(max))
# person col1 col2
# 1 1 1 1
# 2 2 1 0
# 3 3 0 1
or just the base aggregate function
aggregate(.~person, tt, max)
# person col1 col2
# 1 1 1 1
# 2 2 1 0
# 3 3 0 1
Or use data.table.
library(data.table)
setDT(tt)[, lapply(.SD, max), person]
# person col1 col2
#1: 1 1 1
#2: 2 1 0
#3: 3 0 1
Below is another trial just using l(s)apply().
t(sapply(unique(tt$person), function(x) lapply(tt[tt$person==x,], max)))
person col1 col2
[1,] 1 1 1
[2,] 2 1 0
[3,] 3 0 1
Related
I have a data frame where each observation is comprehended in two columns. In this way, columns 1 and 2 represents the individual 1, 3 and 4 the individual 2 and so on.
Basically what I want to do is to add two contigous columns so I have the individual real score.
In this example V1 and V2 represent individual I and V3 and V4 represent individual II. So for the result data frame I will have the half of columns, the same number of rows and each value will be the addition of each value between two contigous colums.
Data
V1 V2 V3 V4
1 0 0 1 1
2 1 0 0 0
3 0 1 1 1
4 0 1 0 1
Desire Output
I II
1 0 2
2 1 0
3 1 2
4 1 1
I tried something like this
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
b <- data.frame(NA, nrow = nrow(a), ncol = ncol(data))
for (i in seq(2,ncol(a),by=2)){
for (k in 1:nrow(a)){
b[k,i] <- a[k,i] + a[k,i-1]
}
}
b <- as.data.frame(b)
b <- b[,-c(seq(1,length(b),by=2))]
Is there a way to make it simplier?
We could use split.default to split the data and then do rowSums by looping over the list
sapply(split.default(a, as.integer(gl(ncol(a), 2, ncol(a)))), rowSums)
1 2
[1,] 0 2
[2,] 1 0
[3,] 1 2
[4,] 1 1
You can use vector recycling to select columns and add them.
res <- a[c(TRUE, FALSE)] + a[c(FALSE, TRUE)]
names(res) <- paste0('col', seq_along(res))
res
# col1 col2
#1 0 2
#2 1 0
#3 1 2
#4 1 1
dplyr's approach with row-wise operations (rowwise is a special type of grouping per row)
a <- data.frame(V1= c(0,1,0,0),V2=c(0,0,1,1),V3=c(1,0,1,0),V4=c(1,0,1,1))
library(dplyr)
a%>%
rowwise()%>%
transmute(I=sum(c(V1,V2)),
II=sum(c(V3,V4)))
or alternatively with a built-in row-wise variant of the sum
a %>% transmute(I = rowSums(across(1:2)),
II = rowSums(across(3:4)))
I want to take the levels of a column in one DF and add each level as a new column in a new DF. Here is a toy dataset showing the source and ideal target DFs.
Source DF
person hour ride
Bill 1 A
Sue 2 B
Bob 1 C
Jill 3 B
Dan 3 A
Tina 3 A
Mapped DF
hour A B C Saturation
1 1 0 1 .66
2 0 1 0 .33
3 1 1 0 .66
Here is a test data set:
test_data <- cbind.data.frame(person = c('Bill', 'Sue', 'Bob', 'Jill', 'Dan', 'Tina'),
hour = factor(c(1, 2, 1, 3, 3, 3)),
ride = c('A', 'B', 'C', 'B', 'A', 'A'))
test_data$person <- as.character(test_data$person)
See how each ride in Source turns into a new column in Mapped. I can get levels and use them to create a mapped DF via
new_data <- cbind.data.frame(hour = levels(test_data$hour))
but it all fails when I try to iterate through levels to add new columns. I see the levels.
unlist(lapply(levels(test_data$ride), function(x) paste(x)))
yields
[1] "A" "B" "C"
So how to go through the levels in $ride and add a column in the mapped DF?
Bonus: I am going to run through each of the rows in test_data and ifelse() a 1 in the column that corresponds to that ride to show it had a rider, and a 0 otherwise, but someone must see how to do this more elegantly? As it stands, I would need an ifelse for every column extracted from the levels in $ride which I know has to be more verbose than required.
require(reshape2)
mydat <- recast(test_data,hour~ride)
mydat
hour A B C
1 1 1 0 1
2 2 0 1 0
3 3 2 1 0
# 2nd part
for(i in 2:ncol(mydat)){
for(ii in 1:nrow(mydat)){
if(mydat[ii,i] > 0) {mydat[ii,i] <- 1}
}
}
hour A B C
1 1 1 0 1
2 2 0 1 0
3 3 1 1 0
We can use dcast from data.table
library(data.table)
dcast(setDT(test_data), hour~ride, value.var="person",
function(x) as.integer(length(x) > 0))[,
Saturation := round(rowSums(.SD)/3,2), .SDcols = A:C][]
# hour A B C Saturation
#1: 1 1 0 1 0.67
#2: 2 0 1 0 0.33
#3: 3 2 1 0 1.00
This question already has answers here:
Empty factors in "by" data.table
(2 answers)
Closed 7 years ago.
Say I have this data.table:
df <- data.frame(ID = c("A","A","A","A","B","B","B","B"),
Flag = c(1,1,1,1,0,0,0,1))
df <- data.table(df)
df
ID Flag
1: A 1
2: A 1
3: A 1
4: A 1
5: B 0
6: B 0
7: B 0
8: B 1
and I wish to count the number of 0 and 1 flags for each ID, as such:
Summary <- df[, list(Count = .N), by = c("ID","Flag")]
this returns the following results:
Summary
ID Flag Count
1: A 1 4
2: B 0 3
3: B 1 1
So, since there are no 0's recorded against ID A, there is no row which lists the combination of ID A and Flag 0, with a count of zero.
What would be the way to do this using data.table?
I.e.. I want to achieve this result:
Summary
ID Flag Count
1 A 0 0
2 A 1 4
3 B 0 3
4 B 1 1
Thanks!
You could factor the column, then tabulate. Since we know we only want 0 and 1 levels, we can just use 0:1 for the Flag column and not need to actually assign the Flag column as a factor. Although this method would be slower (see second part).
df[, .(Flag = 0:1, Count = tabulate(factor(Flag, levels = 0:1))), by = ID]
# ID Flag Count
# 1: A 0 0
# 2: A 1 4
# 3: B 0 3
# 4: B 1 1
As thelatemail notes in the comments, a faster method would be to factor the whole column first, then tabulate based on ID.
df[, Flag := factor(Flag, levels = 0:1)]
df[, .(Flag = levels(Flag), Count = tabulate(Flag)), by = ID]
# ID Flag Count
# 1: A 0 0
# 2: A 1 4
# 3: B 0 3
# 4: B 1 1
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.
I'd like to group this data but apply different functions to some columns when grouping.
ID type isDesc isImage
1 1 1 0
1 1 0 1
1 1 0 1
4 2 0 1
4 2 1 0
6 1 1 0
6 1 0 1
6 1 0 0
I want to group by ID, columns isDesc and isImage can be summed, but I would like to get the value of type as it is. type will be the same through the whole dataset. The result should look like this:
ID type isDesc isImage
1 1 1 2
4 2 1 1
6 1 1 1
Currently I am using
library(plyr)
summarized = ddply(data, .(ID), numcolwise(sum))
but it simply sums up all the columns. You don't have to use ddply but if you think it's good for the job I'd like to stick to it. data.table library is also an alternative
Using data.table:
require(data.table)
dt <- data.table(data, key="ID")
dt[, list(type=type[1], isDesc=sum(isDesc),
isImage=sum(isImage)), by=ID]
# ID type isDesc isImage
# 1: 1 1 1 2
# 2: 4 2 1 1
# 3: 6 1 1 1
Using plyr:
ddply(data , .(ID), summarise, type=type[1], isDesc=sum(isDesc), isImage=sum(isImage))
# ID type isDesc isImage
# 1 1 1 1 2
# 2 4 2 1 1
# 3 6 1 1 1
Edit: Using data.table's .SDcols, you can do this in case you've too many columns that are to be summed, and other columns to be just taken the first value.
dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
> dt2[dt1]
# ID type isDesc isImage
# 1: 1 1 1 2
# 2: 4 2 1 1
# 3: 6 1 1 1
You can provide column names or column numbers as arguments to .SDcols. Ex: .SDcols=c("type") is also valid.