How to choose between two replicated quantities in R - r

This is a simplified example.
I have a data frame with two variables like this:
a <- c(1,1,1,2,2,2,3,3,6,7,4,5,5,8)
b <- c(5,10,4,2,8,4,6,9,12,3,7,4,1,7)
D <- data.frame(a,b)
As you can see, there are 8 values for a but they have replicated, and my data-frame has 14 observations. I want to create a data-frame which has 8 observations in which the a quantities are unique, and the b values are the minimum of choices, i.e., the result should be like:
a b
1 1 4
2 2 2
3 3 6
4 6 12
5 7 3
6 4 7
7 5 1
8 8 7

Here's how to do it with base R:
#both lines do the same thing, pick one
aggregate(D$b, by = D["a"], FUN = min)
aggregate(b ~ a, data = D, FUN = min)
Here's how to do it with data.table:
library(data.table)
setDT(D)
D[ , .(min(b)), by=a]
Here's how to do it with tidyverse functions:
library(tidyverse) #or just library(dplyr)
D %>% group_by(a)
%>% summarize(min(b))

Using R base approach:
> D2 <- D[order(D$a, D$b ), ]
> D2 <- D2[ !duplicated(D2$a), ]
> D2
a b
3 1 4
4 2 2
7 3 6
11 4 7
13 5 1
9 6 12
10 7 3
14 8 7

A base R option would be
aggregate(b ~ a, D, min)

library (dplyr)
D<-D %>% group_by(a) %>% summarize(min(b))

Related

Merge and sum two different data.tables [duplicate]

This question already has answers here:
Adding values in two data.tables
(2 answers)
Closed 5 years ago.
I've 2 different data.tables. I need to merge and sum based on a row values. The examples of two tables are given as Input below and expected output shown below.
Input
Table 1
X A B
A 3
B 4 6
C 5
D 9 12
Table 2
X A B
A 1 5
B 6 8
C 7 14
D 5
E 1 1
F 2 3
G 5 6
Expected Output:
X A B
A 4 5
B 10 14
C 12 14
D 14 12
E 1 1
F 2 3
G 5 6
We can do this by rbinding the two tables and then do a group by sum
library(data.table)
rbindlist(list(df1, df2))[, lapply(.SD, sum, na.rm = TRUE), by = X]
# X A B
#1: A 4 5
#2: B 10 14
#3: C 12 14
#4: D 14 12
#5: E 1 1
#6: F 2 3
#7: G 5 6
Or using a similar approach with dplyr
library(dplyr)
bind_rows(df1, df2) %>%
group_by(X) %>%
summarise_all(funs(sum(., na.rm = TRUE)))
Note: Here, we assume that the blanks are NA and the 'A' and 'B' columns are numeric/integer class
Merge your tables together first, then do the sum. If you later want to drop the individual values you can do so easily.
out <- merge(df1, df2, by.x="X", by.y="X", all.x=T, all.y=T)
out$sum <- rowSums(out[2:3])
out$A <- out$B <- NULL # drop original values
Below code will help you to do required job for all numeric columns at once
library(dplyr)
Table = Table1 %>% full_join(Table2) %>%
group_by(X) %>% summarise_all(funs(sum(.,na.rm = T)))

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

How to count how many values per level in a given factor?

I have a data.frame mydf with about 2500 rows. These rows correspond to 69 classes of objects in colum 1 mydf$V1, and I want to count how many rows per object class I have.
I can get a factor of these classes with:
objectclasses = unique(factor(mydf$V1, exclude="1"));
What's the terse R way to count the rows per object class? If this were any other language I'd be traversing an array with a loop and keeping count but I'm new to R programming and am trying to take advantage of R's vectorised operations.
Or using the dplyr library:
library(dplyr)
set.seed(1)
dat <- data.frame(ID = sample(letters,100,rep=TRUE))
dat %>%
group_by(ID) %>%
summarise(no_rows = length(ID))
Note the use of %>%, which is similar to the use of pipes in bash. Effectively, the code above pipes dat into group_by, and the result of that operation is piped into summarise.
The result is:
Source: local data frame [26 x 2]
ID no_rows
1 a 2
2 b 3
3 c 3
4 d 3
5 e 2
6 f 4
7 g 6
8 h 1
9 i 6
10 j 5
11 k 6
12 l 4
13 m 7
14 n 2
15 o 2
16 p 2
17 q 5
18 r 4
19 s 5
20 t 3
21 u 8
22 v 4
23 w 5
24 x 4
25 y 3
26 z 1
See the dplyr introduction for some more context, and the documentation for details regarding the individual functions.
Here 2 ways to do it:
set.seed(1)
tt <- sample(letters,100,rep=TRUE)
## using table
table(tt)
tt
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
## using tapply
tapply(tt,tt,length)
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
Using plyr package:
library(plyr)
count(mydf$V1)
It will return you a frequency of each value.
Using data.table
library(data.table)
setDT(dat)[, .N, keyby=ID] #(Using #Paul Hiemstra's `dat`)
Or using dplyr 0.3
res <- count(dat, ID)
head(res)
#Source: local data frame [6 x 2]
# ID n
#1 a 2
#2 b 3
#3 c 3
#4 d 3
#5 e 2
#6 f 4
Or
dat %>%
group_by(ID) %>%
tally()
Or
dat %>%
group_by(ID) %>%
summarise(n=n())
We can use summary on factor column:
summary(myDF$factorColumn)
One more approach would be to apply n() function which is counting the number of observations
library(dplyr)
library(magrittr)
data %>%
group_by(columnName) %>%
summarise(Count = n())
In case I just want to know how many unique factor levels exist in the data, I use:
length(unique(df$factorcolumn))
Use the package plyr with lapply to get frequencies for every value (level) and every variable (factor) in your data frame.
library(plyr)
lapply(df, count)
This is an old post, but you can do this with base R and no data frames/data tables:
sapply(levels(yTrain), function(sLevel) sum(yTrain == sLevel))

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

subset with pattern

Say I have a data frame df
df <- data.frame( a1 = 1:10, b1 = 2:11, c2 = 3:12 )
I wish to subset the columns, but with a pattern
df1 <- subset( df, select= (pattern = "1") )
To get
> df1
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Is this possible?
It is possible to do this via
subset(df, select = grepl("1", names(df)))
For automating this as a function, one can use use [ to do the subsetting. Couple that with one of R's regular expression functions and you have all you need.
By way of an example, here is a custom function implementing the ideas I mentioned above.
Subset <- function(df, pattern) {
ind <- grepl(pattern, names(df))
df[, ind]
}
Note this does not error checking etc and just relies upon grepl to return a logical vector indicating which columns match pattern, which is then passed to [ to subset by columns. Applied to your df this gives:
> Subset(df, pattern = "1")
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Same same but different:
df2 <- df[, grep("1", names(df))]
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Base R now has a convenience function endsWith():
df[, endsWith(names(df), "1")]
In data.table you can do:
library(data.table)
setDT(df)
df[, .SD, .SDcols = patterns("1")]
# Or more precisely
df[, .SD, .SDcols = patterns("1$")]
In dplyr:
library(dplyr)
select(df, ends_with("1"))

Resources