I use the following code to summarize my data, grouped by Compound, Replicate and Mass.
summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass),
.fun = calculate_T60_Over_T0_Ratio)
An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.
Thanks for the help.
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
#Some sample data
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315,
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677,
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA,
-6L), class = "data.frame")
#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d
#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d
Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.
Edited to include a strategy for more general cases
If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.
For instance, consider the following data:
d <- data.frame(x1 = rep(letters[1:3],each = 5),
x2 = rep(letters[4:6],5),
x3 = 1:15,stringsAsFactors = FALSE)
using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 a d 5
2 a e 7
3 a f 3
4 b d 17
5 b e 8
6 b f 15
7 c d 13
8 c e 25
9 c f 27
If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:
d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)
Now when we use ddply, the resulting sort will be as we intend:
> ddply(d,.(x1,x2),summarise, val = sum(x3))
x1 x2 val
1 b d 17
2 b f 15
3 b e 8
4 a d 5
5 a f 3
6 a e 7
7 c d 13
8 c f 27
9 c e 25
The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.
I eventually ended up adding an 'indexing' column to the original data frame. It consisted of two columns pasted with sep="_". Then I made another data frame made of only unique members of the 'indexing' column and a counter 1:length(df). I did my ddply() on the data which returned a sorted data frame. Then to get things back in the original order I did merge() the results data frame and the index data frame (making sure the columns are named the same thing makes this easier). Finally, I did order and removed the extraneous columns.
Not an elegant solution, but one that works.
Thanks for the assist. It got me thinking in the right direction.
Related
I would like to create a new column which equals to the mean of several variables (columns) in my data frame. However, I'm afraid I can't use 'rowMeans' because I don't want to average all variables. Moreover, I'm hesitate to manually type all the variable names (which are many). For example:
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
I want to mutate a row called avg which is the average of variables a, b, c, d, and e only. Because in my dataset, the variables names are long (and complex), and there are more than 10 variables, I prefer not to type them out one by one. So I guess I might need to use dplyr package and the mutate function?? Could you please suggest a clever way for me to do that?
The below content is added after your kind comments and answers suggest. Thank you all again:
Actually, the column names that I needed are Mcheck5_1_1, Mcheck5_2_1, ..., Mcheck5_8_1 (so there are 8 in total).However, I tried
my_data$avg = rowMeans(select(my_data, Mcheck5_1_1:Mcheck5_8_1), na.rm = TRUE)
but an error was thrown to me:
Error in select(my_data, Mcheck5_1_1:Mcheck5_8_1) :
unused argument (Mcheck5_1_1:Mcheck5_8_1)
Right now I solved the problem by using the following code:
`idx = grep("Mcheck5_1_1", names(my_data))
my_data$avg = rowMeans(my_data[, idx:idx+7], na.rm = TRUE)`
But is there a more elegant way to do it? Or why couldn't I use select()? Thanks!
I would do something like this
my_data <- data.frame(a = c(1,2,3), b = c(4,5,6), c = c(10,10,10), d = c(13,24,81),
e = c(10, 8, 6), hello = c(1,-1,1), bye = c(1,5,5))
several_variables <- c('a', 'b', 'c', 'd', 'e') #3 or `letters[1:5]`
my_data$avg <- rowMeans(my_data[,several_variables])
my_data
#> a b c d e hello bye avg
#> 1 1 4 10 13 10 1 1 7.6
#> 2 2 5 10 24 8 -1 5 9.8
#> 3 3 6 10 81 6 1 5 21.2
Obviously, if the variables is at some fixed position, and you know they will stay there, you could use the numbered indexing as suggested by Jaap,
my_data$avg <- rowMeans(my_data[,1:5])
This should be easy but I'm having a lot of difficulty.
I have a relatively large dataset of medications,
What I want is a table of frequencies, but ranging over ALL the columns - so I want the medication that appears the most commonly from columns 1:8.
My idea was to combine all of these columns into one long column, just one on top of the other. However, I have tried multiple function (stack, melt, matrix), but they all give me bizarre results. The one that seems correct for me to use is stack, but it keeps returning the error message "Error in stack.data.frame(meds) : no vector columns were selected". I've seen this error on the message boards before - I tried converting into as.vector, but this is not working. The object is definitely of class dataframe.
If there is another way to achieve these table results, that would be great, but either way, it's not working right now. Could somebody help?
Consider do.call or Reduce using c() function to combine all columns into a vector and then count unique meds using sapply loop:
set.seed(79)
meds <- data.frame(MED1=sample(LETTERS, 8),
MED2=sample(LETTERS, 8),
MED3=sample(LETTERS, 8),
MED4=sample(LETTERS, 8),
MED5=sample(LETTERS, 8),
MED6=sample(LETTERS, 8),
MED7=sample(LETTERS, 8),
MED8=sample(LETTERS, 8), stringsAsFactors = FALSE)
medslist <- do.call(c, meds) # OR Reduce(c, meds)
medslength <- sapply(unique(medslist), function(i) length(medslist[medslist==i]))
medslength <- sort(medslength, decreasing=TRUE)
medslength[1:8]
# B U W L I E M R
# 5 5 3 3 3 3 3 3
Try this to get what you want. No stacking necessary:
df = data.frame(Col1 = sample(LETTERS,50,replace=T),
Col2 = sample(LETTERS,50,replace=T))
> table(as.matrix(df))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 2 3 3 4 3 5 4 3 5 3 4 8 4 5 3 6 5 2 5 4 4 2 4 2 3 4
I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2
I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance
I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.
First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;
I'm stuck with a really easy task and hope someone could help me with it..
I'd like to make a new (sub)vector from an existing vector, based on 1 level of a factor.
Example:
v = c(1,2,3,4,5,6,7,8,9,10)
f = factor(rep(c("Drug","Placebo"),5))
I want to make a new vector from v, containing only "Drug" or "Placebo". Resulting in:
vDrug = 1,3,5,7,9
vPlacebo = 2,4,6,8,10
Thanks in advance!
You can easily subset v by f:
v[ f == "Drug" ]
[1] 1 3 5 7 9
However, this approach might become error prone in a more complex environment or with larger data sets. Accordingly it would be better to store vand f together in a data.frame and than perform on this data.frame all kinds of queries and transformations:
mdf <- data.frame( v = c(1,2,3,4,5,6,7,8,9,10), f = factor(rep(c("Drug","Placebo"),5)) )
mdf
v f
1 1 Drug
2 2 Placebo
3 3 Drug
4 4 Placebo
...
If you want to look at your data interactively, you can subset using the subset function:
subset( mdf, f == "Drug", select=v )
If you are doing this programmatically, you should rather use
mdf[ mdf$f == "Drug", "v" ]
For the difference of the two have a look at: Why is `[` better than `subset`?.