R - Output of aggregate and range gives 2 columns for every column name - how to restructure?

R - Output of aggregate and range gives 2 columns for every column name - how to restructure? - r

I am trying to produce a summary table showing the range of each variable by group. Here is some example data:
df <- data.frame(group=c("a","a","b","b","c","c"), var1=c(1:6), var2=c(7:12))
group var1 var2
1 a 1 7
2 a 2 8
3 b 3 9
4 b 4 10
5 c 5 11
6 c 6 12
I used the aggregate function like this:
df_range <- aggregate(df[,2:3], list(df$group), range)
Group.1 var1.1 var1.2 var2.1 var2.2
1 a 1 2 7 8
2 b 3 4 9 10
3 c 5 6 11 12
The output looked normal, but the dimensions are 3x3 instead of 5x3 and there are only 3 names:
names(df_range)
[1] "Group.1" "var1" "var2"
How do I get this back to the normal data frame structure with one name per column? Or alternatively, how do I get the same summary table without using aggregate and range?

That is the documented output of a matrix within the data frame. You can undo the effect with:
newdf <- do.call(data.frame, df_range)
# Group.1 var1.1 var1.2 var2.1 var2.2
#1 a 1 2 7 8
#2 b 3 4 9 10
#3 c 5 6 11 12
dim(newdf)
#[1] 3 5

Here's an approach using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
summarise_each(funs(max(.) - min(.)), var1, var2)

Related

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?

library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1

It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

R Sum columns by index

I need to find a way to sum columns by their index,I'm working on a bigread.csv file, I'll show here a sample of the problem; I'd like for example to sum from the 2nd to the 5th and from the 6th to the 7h the following matrix:
a 1 3 3 4 5 6
b 2 1 4 3 4 1
c 1 3 2 1 1 5
d 2 2 4 3 1 3
The result has to be like this:
a 11 11
b 10 5
c 7 6
d 8 4
The columns have all different names

We can use rowSums on the subset of columns i.e 2:5 and 6:7 separately and then create a new data.frame with the output.
data.frame(df1[1], Sum1=rowSums(df1[2:5]), Sum2=rowSums(df1[6:7]))
# id Sum1 Sum2
#1 a 11 11
#2 b 10 5
#3 c 7 6
#4 d 11 4

The package dplyr has a function exactly made for that purpose:
require(dplyr)
df1 = data.frame(a=c(1,2,3,4,3,3),b=c(1,2,3,2,1,2),c=c(1,2,3,21,2,3))
df2 = df1 %>% transmute(sum1 = a+b , sum2 = b+c)
df2 = df1 %>% transmute(sum1 = .[[1]]+.[[2]], sum2 = .[[2]]+.[[3]])

Subsetting a data frame to the rows not appearing in another data frame

I have a data frame A with observations
Var1 Var2 Var3
1 3 4
2 5 6
4 5 7
4 5 8
6 7 9
and data frame B with observations
Var1 Var2 Var3
1 3 4
2 5 6
which is basically a subset of A.
Now I want to select observations in A NOT in B, i.e, the data frame C with observations
Var1 Var2 Var3
4 5 7
4 5 8
6 7 9
Is there a way I can do this in R? The data frames I've used are just arbitrary data.

dplyr has a nice anti_join function that does exactly that:
> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
Var1 Var2 Var3
1 6 7 9
2 4 5 8
3 4 5 7

Using sqldf is an option.
require(sqldf)
C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')

One approach could be to paste all the columns of A and B together, limiting to the rows in A whose pasted representation doesn't appear in the pasted representation of B:
A[!(do.call(paste, A) %in% do.call(paste, B)),]
# Var1 Var2 Var3
# 3 4 5 7
# 4 4 5 8
# 5 6 7 9
One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:
combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
# Var1 Var2 Var3
# 5 4 5 7
# 6 4 5 8
# 7 6 7 9
Basically I used rbind to append A below B and then limited to rows that are both non-duplicated and that are not originally from B.

Another option:
C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]
Output:
Var1 Var2 Var3
3 4 5 7
4 4 5 8
5 6 7 9

Using data.table you could do an anti-join as follows:
library(data.table)
setDT(df1)[!df2, on = names(df1)]
which gives the desired result:
Var1 Var2 Var3
1: 4 5 7
2: 4 5 8
3: 6 7 9

How to do something to each element in the group

Suppose I have a dataframe like so
a b c
1 2 3
1 3 4
1 4 5
2 5 6
2 6 7
3 7 8
4 8 9
What I want is the following:
a b c d
1 2 3 a
1 3 4 b
1 4 5 c
2 5 6 a
2 6 7 b
3 7 8 a
4 8 9 a
Essentially, I want to do a cycling, for each group by the column a, I want to create a new column which cycles the letters from a to z in order. Group 1 has three elements, so the letter goes from 'a' to 'c'. Group 3 and 4 has only 1 element, so the letter only gets assigned 'a'.

A data.table option is
library(data.table)
setDT(dd)[, d:= letters[seq_len(.N)], by = a]

One way to do this is with a split-apply-combine paradigm, as in plyr (or dplyr or data.table or ...
Create data:
dd <- data.frame(a=rep(1:4,c(3,2,1,1)),
b=2:8,c=3:9)
Use ddply to split the data frame by variable a, transforming each piece by adding an appropriate variable, then recombine:
library("plyr")
ddply(dd,"a",
transform,
d=letters[1:length(b)])
Or in dplyr:
library("dplyr")
dd %>% group_by(a) %>%
mutate(d=letters[1:n()])
Or in base R (thanks #thelatemail):
dd$d <- ave(rownames(dd), dd$a,
FUN=function(x) letters[seq_along(x)] )

Efficient way to remove dups in data frame but determining the row that stays randomly

I'm looking for the most compact and efficient way of looking for dups in a dataframe based on a single variable (user_ID) and randomly keeping one and deleting the others. Been using something like this:
dupIDs <- user_db$user_ID[duplicated(user_db$user_ID)]
The important part is that I want the user_ID variable to be unique, so whenever there are dups, one should be randomly selected (cannot pick first or last, has to be random). I am looking for a loop-less solution - Thanks!
user_ID, var1, var2
1 3 4
1 5 6
2 7 7
3 8 8
Randomly yielding either:
user_ID, var1, var2
1 5 6
2 7 7
3 8 8
or
user_ID, var1, var2
1 3 4
2 7 7
3 8 8
Thanks in advance!!

Here's one option:
library(data.table)
setDT(df) # convert to data.table in place
set.seed(1)
# select 1 row randomly for each user_ID
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 3 4
#2: 2 7 7
#3: 3 8 8
set.seed(4)
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 5 6
#2: 2 7 7
#3: 3 8 8

Using base functions:
DF <-
read.csv(text=
'user_ID,var1,var2
1,3,4
2,7,7
3,8,8
3,6,7
2,5,5
3,5,6
1,5,6')
# sort the data by user_ID
DF <- DF[order(DF$user_ID),]
# create random sub-indexes for each user_ID
subIdx <- unlist(sapply(rle(DF$user_ID)$lengths,FUN=function(l)sample(1:l,l)))
# order again by user_ID then by sub-index
DF <- DF[order(DF$user_ID,subIdx),]
# remove the duplicate
DF <- DF[!duplicated(DF$user_ID),]
> DF
user_ID var1 var2
7 1 5 6
2 2 7 7
4 3 6 7

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Output of aggregate and range gives 2 columns for every column name - how to restructure? - r

That is the documented output of a matrix within the data frame. You can undo the effect with: newdf <- do.call(data.frame, df_range) # Group.1 var1.1 var1.2 var2.1 var2.2 #1 a 1 2 7 8 #2 b 3 4 9 10 #3 c 5 6 11 12 dim(newdf) #[1] 3 5

Here's an approach using dplyr: library(dplyr) df %>% group_by(group) %>% summarise_each(funs(max(.) - min(.)), var1, var2)

Related

How to merge columns in R with different levels of values

R Sum columns by index

Subsetting a data frame to the rows not appearing in another data frame

How to do something to each element in the group

Efficient way to remove dups in data frame but determining the row that stays randomly

Categories

Resources