Aggregate with trimmed means in R - r

I am trying to aggregate data like this in R:
df = data.frame(c("a","a","a","a","a","b","b","b","b","b","c","c","c"))
colnames(df) = "f"
set.seed(10)
df$e = rnorm(13,20,5)
f e
1 a 20.09373
2 a 19.07874
3 a 13.14335
4 a 17.00416
5 a 21.47273
6 b 21.94897
7 b 13.95962
8 b 18.18162
9 b 11.86664
10 b 18.71761
11 c 25.50890
12 c 23.77891
13 c 18.80883
Which I would like to aggregate by the column f and have a trimmed mean of e for each unique f type (i.e. produce 3 rows of data).
I tried:
df2=data.frame(0)
df2=aggregate(df$e, by = "f",mean(df$e, trim=0.1))
got the following error:
Error in match.fun(FUN) :
'mean(df$e, trim = 0.1)' is not a function, character or symbol
Tried a few searches online and came up empty. My actual data consists of around 30 values of e per f so I am not concerned that trim=0.1 won't actually trim the means in the example (because no points lie outside of the upper and lower 5th percentile) it will with the real data, this is just to get the aggregate function working as intended. Thanks!

Try this
df2=aggregate(e~f,data=df,mean,trim=0.1)
f e
1 a 18.15854
2 b 16.93489
3 c 22.69888
Function to use for calculation in this case can be given just by its name, for example, mean, and additional parameters needed for that function are set after comma.

Related

Duplicate each row in a data frame a number of times equal to how many times a value in that row shows up in another data frame?

I apologize as I wasn't quite sure how to word my question without making it extremely lengthy, as the duplicate rows also need to have some altered values from the original.
I have two data frames. The first, df1, records all paths actually taken from source to destination, while the second, df2, contains all possible paths. Some sample data is below:
df1
Row
Source
Destination
Payload
1
A
B
10010101
2
A
D
11101011
3
A
B
10111111
4
E
B
01100110
df2
Row
Source
Destination
1
A
B
2
B
A
3
B
C
4
B
E
5
B
F
6
A
D
7
D
A
8
D
C
9
D
H
For my data, it is assumed that if an object takes a path A -> B for example, it also takes every possible path stemming from B that isn't to the original source (Think of a networking hub. In one way, and out every other). So since we have a payload that goes from A -> B, I also need to record that same payload going from B to C, E, and F. I'm currently accomplishing this in the FOR loop below, but I would like to know if there is a better way to do it, preferably one that doesn't use looping. I'm also somewhat new to R, so even simple corrections to my code are also appreciated.
for (row in 1:dim(df1)[1]){
initialSource <- df1$source[row] #saves the initial source
paths <- df1[row,] #saves the current row for duplication
paths <- paths[rep(1, times = count(df2[df2$source %in% df1$destination[row], ])[[1]]), ] #duplicates the row
paths$source <- paths$destination #replaces the source values to be the location of the hub
paths$destination <- df2$destination[df2$source %in% paths$destination] #replaces the destination values to be every connection from the hub
paths <- paths[!(paths$destination %in% initialSource), ] #removes the row that would indicate data being sent back to the source
masterdf <- rbind(masterdf, paths) #saving the new data to a larger data frame that df1 is actually a sample of.
}
The data frame paths by the end of the first loop with the above data would look like:
Row
Source
Destination
Payload
1
B
C
10010101
2
B
E
10010101
3
B
F
10010101
Maybe you could try merging your two dataframes. With base R merge you could do the following (using "Destination" from df1 and "Source" from df2). You would need to remove rows to exclude the "original source" as you described. Renaming and selecting the columns gives you the final output. Please let me know if this is what you had in mind.
d <- subset(
merge(df1, df2, by.x = "Destination", by.y = "Source", all = TRUE),
Source != Destination.y
)
data.frame(
Source = d$Destination,
Destination = d$Destination.y,
Payload = d$Payload
)
Output
Source Destination Payload
1 B C 10010101
2 B E 10010101
3 B F 10010101
4 B C 10111111
5 B E 10111111
6 B F 10111111
7 B C 1100110
8 B F 1100110
9 B A 1100110
10 D C 11101011
11 D H 11101011

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

R: Stack error - How to merge multiple columns into one very long column in R

This should be easy but I'm having a lot of difficulty.
I have a relatively large dataset of medications,
What I want is a table of frequencies, but ranging over ALL the columns - so I want the medication that appears the most commonly from columns 1:8.
My idea was to combine all of these columns into one long column, just one on top of the other. However, I have tried multiple function (stack, melt, matrix), but they all give me bizarre results. The one that seems correct for me to use is stack, but it keeps returning the error message "Error in stack.data.frame(meds) : no vector columns were selected". I've seen this error on the message boards before - I tried converting into as.vector, but this is not working. The object is definitely of class dataframe.
If there is another way to achieve these table results, that would be great, but either way, it's not working right now. Could somebody help?
Consider do.call or Reduce using c() function to combine all columns into a vector and then count unique meds using sapply loop:
set.seed(79)
meds <- data.frame(MED1=sample(LETTERS, 8),
MED2=sample(LETTERS, 8),
MED3=sample(LETTERS, 8),
MED4=sample(LETTERS, 8),
MED5=sample(LETTERS, 8),
MED6=sample(LETTERS, 8),
MED7=sample(LETTERS, 8),
MED8=sample(LETTERS, 8), stringsAsFactors = FALSE)
medslist <- do.call(c, meds) # OR Reduce(c, meds)
medslength <- sapply(unique(medslist), function(i) length(medslist[medslist==i]))
medslength <- sort(medslength, decreasing=TRUE)
medslength[1:8]
# B U W L I E M R
# 5 5 3 3 3 3 3 3
Try this to get what you want. No stacking necessary:
df = data.frame(Col1 = sample(LETTERS,50,replace=T),
Col2 = sample(LETTERS,50,replace=T))
> table(as.matrix(df))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 2 3 3 4 3 5 4 3 5 3 4 8 4 5 3 6 5 2 5 4 4 2 4 2 3 4

Function that group values of a list (in R)

I am trying to construct a function which shouldn't be hard in terms of programming but I am having some difficulties to conceptualize it. Hope you'll be able to understand my problem better than me!
I'd like a function that takes a single list of vectors as argument. Something like
arg1 = list(c(1,2), c(2,3), c(5,6), c(1,3), c(4,6), c(6,7), c(7,5), c(5,8))
The function should output a matrix with two columns (or a list of two vectors or something like that) where one column contains letters and the other numbers. One can think of the argument as a list of the positions/values that should be placed in the same group. If in the list there is the vector c(5,6), then the output should contain somewhere the same letters next to the values 5 and 6 in the number column. If there are the three following vectors c(1,2), c(2,3) and c(1,3), then the output should contain somewhere the same letters next to the value 1, 2 and 3 in the number column.
Therefore if we enter the object arg1 in the function it should return:
myFun(arg1)
number_column letters_column
1 A
2 A
3 A
5 B
6 B
7 B
4 C
6 C
5 D
8 D
(the order is not important. The letters E should not be present before the letter D has been used)
Therefore the function has constructed 2 groups of 3 (A:[1,2,3] and B:[5,6,7]) and 2 groups of 2 (C:[4,6] and D:[5,8]). Note one position or number can be in several group.
Please let me know if something is unclear in my question! Thanks!
As I wrote in the comments, it appears that you want a data frame that lists the maximal cliques of a graph given a list of vectors that define the edges.
require(igraph)
## create a matrix where each row is an edge
argmatrix <- do.call(rbind, arg1)
## create an igraph object from the matrix of edges
gph <- graph.edgelist(argmatrix, directed = FALSE)
## returns a list of the maximal cliques of the graph
mxc <- maximal.cliques(gph)
## creates a data frame of the output
dat <- data.frame(number_column = unlist(mxc),
group_column = rep.int(seq_along(mxc),times = sapply(mxc,length)))
## converts group numbers to letters
## ONLY USE if max(dat$group_column) <= 26
dat$group_column <- LETTERS[dat$group_column]
# number_column group_column
# 1 5 A
# 2 8 A
# 3 5 B
# 4 6 B
# 5 7 B
# 6 4 C
# 7 6 C
# 8 3 D
# 9 1 D
# 10 2 D

How to ddply() without sorting?

I use the following code to summarize my data, grouped by Compound, Replicate and Mass.
summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass),
.fun = calculate_T60_Over_T0_Ratio)
An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.
Thanks for the help.
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")
#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d
Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.
Edited to include a strategy for more general cases
If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.
For instance, consider the following data:
d <- data.frame(x1 = rep(letters[1:3],each = 5),
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)
using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:
d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)
Now when we use ddply, the resulting sort will be as we intend:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25
The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.
I eventually ended up adding an 'indexing' column to the original data frame. It consisted of two columns pasted with sep="_". Then I made another data frame made of only unique members of the 'indexing' column and a counter 1:length(df). I did my ddply() on the data which returned a sorted data frame. Then to get things back in the original order I did merge() the results data frame and the index data frame (making sure the columns are named the same thing makes this easier). Finally, I did order and removed the extraneous columns.
Not an elegant solution, but one that works.
Thanks for the assist. It got me thinking in the right direction.

Resources