Unsplit reduced data table based on two factors in R

Unsplit reduced data table based on two factors in R - r

Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)

If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.

The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2

Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)

Related

How do I make the list output from the 'by' function in R usable?

I have a set of data with a dependent variable and two factors. I would like randomly sample the dependent variable (with replacement) within each subset of combinations of my two factors (and the number of random samples retrieved should equal the number that existed originally at each combination of the two factors). I've been able to do this using the 'by' function. The problem is the output is a list and I'd like something more accessible but haven't had any luck converting to a data frame. My end goal is to run the simulation described above 1000 times and for each simulation calculate the average of the random samples retrieved for each combination of the factors.
This produces the dataset:
value<-runif(100,5,25)
cat1<-factor(rep(1:10,10))
a<-rep("A",50)
b<-rep("B",50)
cat2<-append(a,b)
data<-as.data.frame(cbind(value,cat1,cat2))
This creates one simulation of random values drawn from the factor levels and
stores that info in a list:
list<-by(data[,"value"],data[,c("cat1","cat2")],function(x) sample(x,length(x),T))
What I'd like to do is wind up with a dataframe that has as columns "Simulation", "AverageValue", "cat1", and "cat2" - so that I would have 1000 simulation lines for each combination of cat1 and cat 2.
Any suggestions on how to make the 'by' output more accessible so I can run a for loop on the output or other suggestions would be great.
Thanks!

As a more general method, you might like to use dplyr rather than by. this way you'll keep your data.frame.
In this case, you would use group_by to group by your cat1 and cat2, rather than by, and use mutate to add a new column on. You could replace new = with value = if you don't want to keep your old data:
library(dplyr)
data %>% group_by(cat1, cat2) %>%
mutate(new = sample(value, length(value), replace = T))
Source: local data frame [100 x 4]
Groups: cat1, cat2 [20]
value cat1 cat2 new
(fctr) (fctr) (fctr) (fctr)
1 13.9639607304707 1 A 13.2139691384509
2 22.6068278681487 2 A 5.27278678957373
3 24.6930849226192 3 A 22.0293137291446
4 16.842244095169 4 A 9.56347029190511
5 18.467006101273 5 A 23.1605510273948
6 20.6661582039669 6 A 24.3043746100739
7 9.37060782220215 7 A 13.9268753770739
8 6.68592340312898 8 A 20.034239795059
9 6.95704637560993 9 A 12.676755907014
10 17.2769332909957 10 A 24.453850784339
.. ... ... ...

How to select specific elements and find their index in a data.frame?

I would like to select specific elements of a data.list after processing it.
To get process parameters I describe the my problem in the reproducible example.
In the example code below, I have three sets of data.list each have 5 column.
Each data.list repeat theirselves three times each and each data.list assignet to unique number called set_nbr which defines these datasets.
#to create reproducible data (this part creates three sets of data each one repeats 3 times of those of Mx, My and Mz values along with set_nbr)
set.seed(1)
data.list <- lapply(1:3, function(x) {
nrep <- 3
time <- rep(seq(90,54000,length.out=600),times=nrep)
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012),decreasing=TRUE)))
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02),decreasing=TRUE)))
Mz <- c(replicate(nrep,sort(runif(600,-1,1),decreasing=TRUE)))
df <- data.frame(time,Mx,My,Mz,set_nbr=x)
})
after applying some function I have output like this.
result
time Mz set_nbr
1 27810 -1.917835e-03 1
2 28980 -1.344288e-03 1
3 28350 -3.426615e-05 1
4 27900 -9.934413e-04 1
5 25560 -1.016492e-02 2
6 27360 -4.790767e-03 2
7 28080 -7.062256e-04 2
8 26550 -1.171716e-04 2
9 26820 -2.495893e-03 3
10 26550 -7.397865e-03 3
11 26550 -2.574022e-03 3
12 27990 -1.575412e-02 3
My questions starts from here.
1) How to get min,middle and max values of time column, for each set_nbr ?
2) How to use evaluated set_nbr and Mz values inside of data.list?
In short;
After deciding the min,middle and max values from time column and corresponding Mz values for each set_nbr in result, I want to return back to original data.list and extract those columns of Mx, My, Mz according those of set_nbr and Mz values. Since each set_nbr actually corresponding to 600 rows, I would like to extract those defined set_nbrs family from data.list
we use time as a factor to select set_nbr. Here factor means as extraction parameter not the real factor in R command.
In addition, as you will see four set_nbr exist for each dataset but they are indeed addressing different dataset in the data.list

I'm a big advocate of using lists of data frames when appropriate, but in this case it doesn't look like there's any reason to keep them separated as different list items. Let's combine them into a single data frame.
library(dplyr)
dat = bind_rows(data.list)
Then getting your summary stats is easy:
dat %>% group_by(set_nbr) %>%
summarize(min_time = min(time),
max_time = max(time),
middle_time = median(time))
# Source: local data frame [3 x 4]
#
# set_nbr min_time max_time middle_time
# 1 1 90 54000 27045
# 2 2 90 54000 27045
# 3 3 90 54000 27045
In your sample data, time is defined the same way each time, so of course the min, median, and max are all the same.
I'd suggest, in the new question you ask about plotting, starting with the combined data frame dat.
As to your second question:
2) How to select evaluated set_nbr values inside of data.list?
Selecting a single item from a list, use double brackets
data.list[[2]]
However, with the combined data, it's just a normal column of a normal data frame so any of these will work:
dat[dat$set_nbr == 2, ]
subset(dat, set_nbr == 2)
filter(dat, set_nbr == 2)
To your clarification in comments, if you want the Mx and My values for the time and set_nbr in the results object, using my combined dat above, simply do a join: left_join(results, dat).
This should work, but I'm a little confused because in your simulated data time is numeric, but in your new text you say "we use time as a factor". If you've converted time to a factor object, this will only work if it has the same levels in each of the data frames in your data list. If not, I would recommend keeping time as numeric.

aggregating counts per category

I have a dataset (df) where I would just like to get some summary stats for the entire column variables and then a summary for the variables of 2 specific treatments. So far so good:
summary(var1)
aggregate(var1 ~ treatment, results, summary)
I then have one variable that are values of 1 and 2. I can count these with the sum function:
sum(var3 == 1)
sum(var3 == 2)
However, when I try to sum these by treatment:
aggregate(var3 ~ treatment, results, sum var3 == 1)
I get the following error:
Error in sum == 1 :
comparison (1) is possible only for atomic and list types
I have tried lots of variations on the same theme and taken a look through the textbooks I am using to help me with my first forays into R... but I can't seem to find the answer.

Here's a sample dataset (it's always best to include sample data to make your question reproducible).
set.seed(15)
results<-data.frame(
var1=runif(30),
var3=sample(1:2, 30, replace=T),
treatment=gl(2,15)
)
If you really want to use aggregate, you can do
aggregate(var3==1~treatment, results, sum)
# treatment var3 == 1
# 1 1 9
# 2 2 5
but since you're counting discrete observations, table() may be a better choice to do all the counting at once
with(results, table(var3, treatment))
# treatment
# var3 1 2
# 1 9 5
# 2 6 10

R: summing values of matched names and adding on new names' values

I am trying to a simple task, and created a simple example. I would like to add the counts of a taxon recorded in a vector ('introduced',below) to the counts already measured in another vector ('existing'), according to the taxon name. However, when there is a new taxon (present in introduced by not in existing), I would like this taxon and its count to be added as a new entry in the matrix (doesn't matter what order, but name needs to be retained).
For example:
existing<-c(3,4,5,6)
names(existing)<-c("Tax1","Tax2","Tax3","Tax4")
introduced<-c(2,2)
names(introduced)<-c("Tax1","Tax5")
I want new matrix, called "combined" here, to look like this:
#names(combined)= c("Tax1","Tax2","Tax3","Tax4","Tax5")
#combined= c(5,4,5,6,2)
The main thing to see is that "Tax1"'s values are combined (3+2=5), "Tax5" (2) is added on to the end
I have looked around but previous answers similar to this have much more complex data and it is difficult to extract which function I need. I have been trying combinations of match and which, but just cannot get it right.

grp <- c(existing,introduced)
tapply(grp,names(grp),sum)
#Tax1 Tax2 Tax3 Tax4 Tax5
# 5 4 5 6 2

Instead of keeping your data in 'loose' vectors, you may consider collecting them in one data frame. First, put you two sets of vector data in data frames:
existing <- c(3, 4, 5, 6)
taxon <- c("Tax1", "Tax2", "Tax3", "Tax4")
df1 <- data.frame(existing, taxon)
introduced <- c(2, 2)
taxon <- c("Tax1", "Tax5")
df2 <- data.frame(introduced, taxon)
Then merge the two data frames by the common column, 'taxon'. Set all = TRUE to include all rows from both data frames:
df3 <- merge(df1, df2, all = TRUE)
Finally, sum 'existing' and 'introduced' taxon, and add the result to the data frame:
df3$combined <- rowSums(df3[ , c("existing", "introduced")], na.rm = TRUE)
df3
# taxon existing introduced combined
# 1 Tax1 3 2 5
# 2 Tax2 4 NA 4
# 3 Tax3 5 NA 5
# 4 Tax4 6 NA 6
# 5 Tax5 NA 2 2

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help

apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unsplit reduced data table based on two factors in R - r

The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case. > library(doBy) > summaryBy( . ~ factor1+factor2, data = myDataFrame) # factor1 factor2 val1.mean val2.mean # 1 1 3 2 8 # 2 2 4 5 5 # 3 3 5 8 2

Have you tried aggregate? aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean) aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)

Related

How do I make the list output from the 'by' function in R usable?

How to select specific elements and find their index in a data.frame?

aggregating counts per category

R: summing values of matched names and adding on new names' values

Read multidimensional group data in R

Categories

Resources