Function to store data.frames and calculate mean? - r

I'm trying to come up with a function and got stuck. I need to run a function (ses.mpd) 1000 times with randomized matrices. The outputs (data.frames) should be stored and then a data.frame with means of those 1000 output data.frames should be calculated.
Example:
output data.frames
ntaxa mpd.obs mpd.rand.mean mpd.rand.sd
sample1 3 10 9 0.2
sample2 6 15 12 0.6
sample3 4 9 10 0.1
ntaxa mpd.obs mpd.rand.mean mpd.rand.sd
sample1 6 12 10 0.5
sample2 4 12 15 0.3
sample3 7 4 7 0.3
result data.frame should look like this
ntaxa mpd.obs mpd.rand.mean mpd.rand.sd
sample1 4.5 11 9.5 0.35
sample2 5 13.5 13.5 0.45
sample3 5.5 6.5 8.5 0.2
I think I have save the 1000 data.frames in a list and then maybe use the ddply function in plyr, but I have not really an idea how to do this.

If all the matrices are the same (e.g. same dimensions and same variable locations), then I would store them in a 3d array and use apply or rowMeans, etc. The latter will be faster.
Using a built-in dataset:
> dim(UCBAdmissions)
[1] 2 2 6
> rowMeans( UCBAdmissions, dims=c(2) )
Gender
Admit Male Female
Admitted 199.6667 92.83333
Rejected 248.8333 213.00000

Related

NAs introduced by coercion - mixed vector

NAs introduced by coercion. How to get around this? Thank you for your help.
water <- 785.5
volume_water <- as.numeric(as.character(c("water", water)))
volume_water
[1] NA 785.5
This is dataframe called data
Substance v1
1 abc 12.5
2 defg 100.0
3 hijk 100.0
4 abfg 2.0
I want to achieve:
rbind(data, volume_water)
Substance v1
1 abc 12.5
2 defg 100.0
3 hijk 100.0
4 abfg 2.0
5 water 785.5
I would create the object as a data frame, i.e.:
volume_water = data.frame(Substance="water", v1=785.5)
Then you can rbind it with data.

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

subsetting closed values in a column based on binary column in a data frame by R

I have a data frame with 85 rows and 35 columns which is sorted based on age column, like below:
No Gender Age
1 F 5.8
2 F 5.9
3 F 6
4 M 6.2
5 F 7
6 F 7.2
7 M 7.4
8 M 7.8
9 M 7.9
10 M 8.1
11 F 8.3
12 F 8.6
13 M 8.9
14 M 9
15 F 9.2
16 F 9.3
I need to subset closest ages in different genders. like below:
No Gender Age
1 F 6
2 M 6.2
3 F 7.2
4 M 7.4
5 M 8.1
6 F 8.3
7 F 8.6
8 M 8.9
9 M 9
10 F 9.2
Ok, I think I got this. It was surprisingly difficult, and maybe someone else will be able to come up with a more elegant solution, but here's what I got:
df <- data.frame(No=c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L),Gender=c('F','F','F','M','F','F','M','M','M','M','F','F','M','M','F','F'),Age=c(5.8,5.9,6,6.2,7,7.2,7.4,7.8,7.9,8.1,8.3,8.6,8.9,9,9.2,9.3),stringsAsFactors=F);
mls <- df$Gender=='M';
mages <- df$Age[mls];
fages <- df$Age[!mls];
fisLower <- findInterval(mages,fages);
TOL <- 1e-5;
fisClosest <- fisLower+ifelse(fisLower==0L | fisLower<length(fages) & mages-fages[replace(fisLower,fisLower==0L,NA)]>fages[fisLower+1L]-mages+TOL,1L,0L);
mis <- unname(tapply(seq_along(mages),fisClosest,function(is) is[which.min(abs(mages[is]-fages[fisClosest[is[1L]]]))]));
fis <- unique(fisClosest);
df[sort(c(which(mls)[mis],which(!mls)[fis])),];
## No Gender Age
## 3 3 F 6.0
## 4 4 M 6.2
## 6 6 F 7.2
## 7 7 M 7.4
## 10 10 M 8.1
## 11 11 F 8.3
## 12 12 F 8.6
## 13 13 M 8.9
## 14 14 M 9.0
## 15 15 F 9.2
Explanation of variables:
df The input data.frame.
mls "male logicals": A logical vector representing which elements of df$Gender are male.
mages "male ages": The subset of df$Age for male rows.
fages "female ages": The subset of df$Age for female rows.
fisLower "female indexes lower": For each element of mages, this has the index into fages of the female age that lies just below (or possibly equal to) the male age. This could be zero if fages has no ages below the element of mages. Hence this vector is "parallel" to mages, meaning it's the same length and the elements correspond to each other.
TOL "tolerance" This was a necessary annoyance to prevent spurious floating-point comparison errors in the following statement.
fisClosest "female indexes closest" This is a simple transformation of fisLower. Basically, we must add 1L to each element of fisLower if the corresponding element of mages is actually closer to the subsequent element of fages (the "upper" one) rather than the one pointed to by the corresponding element of fisLower (the "lower" one). This must be done for two cases: (1) zero elements of fisLower, and (2) where the element of fisLower points to a non-last element of fages and the element of mages is actually closer to the subsequent element of fages.
mis "male indexes" First of all, understand that fisClosest may contain duplicates if multiple male ages have the same female age as their closest, IOW there is no other female age closer to that male age, for all of them. For each of these conflicts, we must find the one male age that is closest to the female age from the set of male ages. This requires a vector aggregation for which tapply() is appropriate. We group by fisClosest, passing mages indexes into the lambda, where we call which.min() on the absolute differences between the ages to get the winning male age, and return its index.
fis "female indexes" This is simply the unique set of indexes into fages which we need to select from df; we get this from fisClosest by removing duplicates.
At this point we can finally convert from mages and fages indexes (mis and fis) to df row indexes by indexing the appropriate respective polarities of mls. After combining and sorting the two index sets, we can finally index df to get the required output.
Original (Incorrect) Solution
It looks like you want the first and last row of each run length, excepting the first and last row of the entire data.frame. Here's one way to achieve that:
df <- data.frame(No=c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L,14L,15L,16L),Gender=c('F','F','F','M','F','F','M','M','M','M','F','F','M','M','F','F'),Age=c(5.8,5.9,6,6.2,7,7.2,7.4,7.8,7.9,8.1,8.3,8.6,8.9,9,9.2,9.3),stringsAsFactors=F);
x <- cumsum(rle(df$Gender)$lengths); df2 <- df[unique(c(rbind(c(1L,x[-length(x)]+1L),x))),];
df2 <- df2[-c(1L,nrow(df2)),]; ## remove first and last row from original data.frame
df2;
## No Gender Age
## 3 3 F 6.0
## 4 4 M 6.2
## 5 5 F 7.0
## 6 6 F 7.2
## 7 7 M 7.4
## 10 10 M 8.1
## 11 11 F 8.3
## 12 12 F 8.6
## 13 13 M 8.9
## 14 14 M 9.0
## 15 15 F 9.2
I think you missed the F 7.0 row in your expected output; other than that, this gets the same set of rows. If you want to fix up No to be sequential from 1, you can run df2$No <- seq_len(nrow(df2)). Ditto for the row names (with rownames(df2) on the LHS).

R: generate possible permutation tables by one column

I have a table that looks like this:
Indikaatori nimi Alamkriteerium Kriteerium Skoor
1 Indikaator 1 1.1 1 100
2 Indikaator 2 1.2 1 100
3 Indikaator 3 1.3 1 100
4 Indikaator 4 1.1 1 0
5 Indikaator 5 2.1 2 0
6 Indikaator 6 2.1 2 0
... and so on...
I need to create all possible permutations of the table by the first column.
There's a total of 50 indicators, from which i want to pick 49 and get all the possible combinations along with the chosen elements other data columns.
With 49 elements out of 50, i will get a total of 50 permutations, but i want to automatically create all these tables without doing it manually (later on 48 elements is also necessary).
Is there any way to generate these 50 tables automatically with the respective data to the chosen elements?
All help and pointers are appreciated!
# The following will give you a list of fifty data frames,
# Each data frame has a 49 row subset of the original
listoftables <- apply(combn(1:50, 49), 2, FUN = function(x) df[x,])
This solution uses loops, which are rather slow compared to vectorized operations in R, but it will get you what you need in the form of a list of data.frames.
datatable = read.table(textConnection(
"2 Indikaator 2 1.2 1 100
3 Indikaator 3 1.3 1 100
4 Indikaator 4 1.1 1 0
5 Indikaator 5 2.1 2 0
6 Indikaator 6 2.1 2 0"))
x = rep(list(data.frame(NULL)),times = 2^nrow(datatable))
a = 1
for (i in 1:nrow(datatable)){
sets = combn(nrow(datatable),i)
for (j in 1:ncol(sets)){
x[[a]] = datatable[sets[,j],]
a = a+1
}
}
View(x[[10]])

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

Resources