R - Output of aggregate and range gives 2 columns for every column name - how to restructure? - r

I am trying to produce a summary table showing the range of each variable by group. Here is some example data:
df <- data.frame(group=c("a","a","b","b","c","c"), var1=c(1:6), var2=c(7:12))
group var1 var2
1 a 1 7
2 a 2 8
3 b 3 9
4 b 4 10
5 c 5 11
6 c 6 12
I used the aggregate function like this:
df_range <- aggregate(df[,2:3], list(df$group), range)
Group.1 var1.1 var1.2 var2.1 var2.2
1 a 1 2 7 8
2 b 3 4 9 10
3 c 5 6 11 12
The output looked normal, but the dimensions are 3x3 instead of 5x3 and there are only 3 names:
names(df_range)
[1] "Group.1" "var1" "var2"
How do I get this back to the normal data frame structure with one name per column? Or alternatively, how do I get the same summary table without using aggregate and range?

That is the documented output of a matrix within the data frame. You can undo the effect with:
newdf <- do.call(data.frame, df_range)
# Group.1 var1.1 var1.2 var2.1 var2.2
#1 a 1 2 7 8
#2 b 3 4 9 10
#3 c 5 6 11 12
dim(newdf)
#[1] 3 5

Here's an approach using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
summarise_each(funs(max(.) - min(.)), var1, var2)

Related

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

R Sum columns by index

I need to find a way to sum columns by their index,I'm working on a bigread.csv file, I'll show here a sample of the problem; I'd like for example to sum from the 2nd to the 5th and from the 6th to the 7h the following matrix:
a 1 3 3 4 5 6
b 2 1 4 3 4 1
c 1 3 2 1 1 5
d 2 2 4 3 1 3
The result has to be like this:
a 11 11
b 10 5
c 7 6
d 8 4
The columns have all different names
We can use rowSums on the subset of columns i.e 2:5 and 6:7 separately and then create a new data.frame with the output.
data.frame(df1[1], Sum1=rowSums(df1[2:5]), Sum2=rowSums(df1[6:7]))
# id Sum1 Sum2
#1 a 11 11
#2 b 10 5
#3 c 7 6
#4 d 11 4
The package dplyr has a function exactly made for that purpose:
require(dplyr)
df1 = data.frame(a=c(1,2,3,4,3,3),b=c(1,2,3,2,1,2),c=c(1,2,3,21,2,3))
df2 = df1 %>% transmute(sum1 = a+b , sum2 = b+c)
df2 = df1 %>% transmute(sum1 = .[[1]]+.[[2]], sum2 = .[[2]]+.[[3]])

Subsetting a data frame to the rows not appearing in another data frame

I have a data frame A with observations
Var1 Var2 Var3
1 3 4
2 5 6
4 5 7
4 5 8
6 7 9
and data frame B with observations
Var1 Var2 Var3
1 3 4
2 5 6
which is basically a subset of A.
Now I want to select observations in A NOT in B, i.e, the data frame C with observations
Var1 Var2 Var3
4 5 7
4 5 8
6 7 9
Is there a way I can do this in R? The data frames I've used are just arbitrary data.
dplyr has a nice anti_join function that does exactly that:
> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
Var1 Var2 Var3
1 6 7 9
2 4 5 8
3 4 5 7
Using sqldf is an option.
require(sqldf)
C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')
One approach could be to paste all the columns of A and B together, limiting to the rows in A whose pasted representation doesn't appear in the pasted representation of B:
A[!(do.call(paste, A) %in% do.call(paste, B)),]
# Var1 Var2 Var3
# 3 4 5 7
# 4 4 5 8
# 5 6 7 9
One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:
combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
# Var1 Var2 Var3
# 5 4 5 7
# 6 4 5 8
# 7 6 7 9
Basically I used rbind to append A below B and then limited to rows that are both non-duplicated and that are not originally from B.
Another option:
C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]
Output:
Var1 Var2 Var3
3 4 5 7
4 4 5 8
5 6 7 9
Using data.table you could do an anti-join as follows:
library(data.table)
setDT(df1)[!df2, on = names(df1)]
which gives the desired result:
Var1 Var2 Var3
1: 4 5 7
2: 4 5 8
3: 6 7 9

How to do something to each element in the group

Suppose I have a dataframe like so
a b c
1 2 3
1 3 4
1 4 5
2 5 6
2 6 7
3 7 8
4 8 9
What I want is the following:
a b c d
1 2 3 a
1 3 4 b
1 4 5 c
2 5 6 a
2 6 7 b
3 7 8 a
4 8 9 a
Essentially, I want to do a cycling, for each group by the column a, I want to create a new column which cycles the letters from a to z in order. Group 1 has three elements, so the letter goes from 'a' to 'c'. Group 3 and 4 has only 1 element, so the letter only gets assigned 'a'.
A data.table option is
library(data.table)
setDT(dd)[, d:= letters[seq_len(.N)], by = a]
One way to do this is with a split-apply-combine paradigm, as in plyr (or dplyr or data.table or ...
Create data:
dd <- data.frame(a=rep(1:4,c(3,2,1,1)),
b=2:8,c=3:9)
Use ddply to split the data frame by variable a, transforming each piece by adding an appropriate variable, then recombine:
library("plyr")
ddply(dd,"a",
transform,
d=letters[1:length(b)])
Or in dplyr:
library("dplyr")
dd %>% group_by(a) %>%
mutate(d=letters[1:n()])
Or in base R (thanks #thelatemail):
dd$d <- ave(rownames(dd), dd$a,
FUN=function(x) letters[seq_along(x)] )

Efficient way to remove dups in data frame but determining the row that stays randomly

I'm looking for the most compact and efficient way of looking for dups in a dataframe based on a single variable (user_ID) and randomly keeping one and deleting the others. Been using something like this:
dupIDs <- user_db$user_ID[duplicated(user_db$user_ID)]
The important part is that I want the user_ID variable to be unique, so whenever there are dups, one should be randomly selected (cannot pick first or last, has to be random). I am looking for a loop-less solution - Thanks!
user_ID, var1, var2
1 3 4
1 5 6
2 7 7
3 8 8
Randomly yielding either:
user_ID, var1, var2
1 5 6
2 7 7
3 8 8
or
user_ID, var1, var2
1 3 4
2 7 7
3 8 8
Thanks in advance!!
Here's one option:
library(data.table)
setDT(df) # convert to data.table in place
set.seed(1)
# select 1 row randomly for each user_ID
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 3 4
#2: 2 7 7
#3: 3 8 8
set.seed(4)
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 5 6
#2: 2 7 7
#3: 3 8 8
Using base functions:
DF <-
read.csv(text=
'user_ID,var1,var2
1,3,4
2,7,7
3,8,8
3,6,7
2,5,5
3,5,6
1,5,6')
# sort the data by user_ID
DF <- DF[order(DF$user_ID),]
# create random sub-indexes for each user_ID
subIdx <- unlist(sapply(rle(DF$user_ID)$lengths,FUN=function(l)sample(1:l,l)))
# order again by user_ID then by sub-index
DF <- DF[order(DF$user_ID,subIdx),]
# remove the duplicate
DF <- DF[!duplicated(DF$user_ID),]
> DF
user_ID var1 var2
7 1 5 6
2 2 7 7
4 3 6 7

Resources