Merge values of a factor column - r

Column data$form contains 170 unique different values, (numbers from 1 to ~800).
I would like to merge some values (e.g with a 10 radius/step).
I need to do this in order to use:
colors = rainbow(length(unique(data$form)))
In a plot and provide a better visual result.
Thank you in advance for your help.

you can use %/% to group them and mean to combine them and normalize to scale them.
# if you want specifically 20 groups:
groups <- sort(form) %/% (800/20)
x <- c(by(sort(form), groups, mean))
x <- normalize(x, TRUE) * 19 + 1
0 1 2 3 4
1.000000 1.971781 2.957476 4.103704 4.948560
5 6 7 8 9
5.950617 7.175309 7.996914 8.953086 9.952263
10 11 12 13 14
10.800705 11.901235 12.888889 13.772291 14.888889
15 16 17 18 19
15.927984 16.864198 17.918519 18.860082 20.000000

You could also use cut. If you use the argument labels=FALSE, you get an integer value:
form <- runif(170, min=1,max=800)
> cut(form, breaks=20)
[1] (518,558] (280,320] (240,280] (121,160] (757,797]
[6] (160,200] (320,359] (598,638] (80.8,121] (359,399]
[7] (121,160] (200,240] ...
20 Levels: (1.18,41] (41,80.8] (80.8,121] (121,160] (160,200] (200,240] (240,280] (280,320] (320,359] (359,399] (399,439] ... (757,797]
> cut(form, breaks=20, labels=FALSE)
[1] 14 8 7 4 20 5 9 16 3 10 4 6 5 18 18 6 2 12
[19] 2 19 13 11 13 11 14 12 17 5 ...
On a side-note, I want you to re-consider plotting with rainbow colours, as it distorts reading the data, cf. Rainbow Color Map (Still) Considered Harmful.

Related

Find the the mean of several indexed lists at index values in R

I have 4 predicted y values presented as an indexed list in R:
> y_a
2 12 15 19 20 22 3 4
26.05434 24.33894 38.57935 37.94003 23.87608 46.20327 18.43043 24.96521
5 8 13 21 1 7 10 11
17.34129 30.41087 28.49836 39.02917 21.96358 30.41087 23.61032 30.41087
16 18
35.31196 35.85652
> y_b
6 9 14 17 23 24 3 4
36.87726 35.30301 40.48044 38.24398 42.67726 41.31053 32.32106 33.81204
5 8 13 21 1 7 10 11
32.07257 35.05451 40.31655 44.74850 38.82558 35.05451 27.80451 35.05451
16 18
36.17274 36.29699
> y_c
6 9 14 17 23 24 2 12
30.24043 35.33617 39.18723 33.63404 42.76170 39.36809 32.25106 24.04894
15 19 20 22 1 7 10 11
39.34681 38.28298 31.01702 43.66596 33.19787 34.71915 27.60213 34.71915
16 18
37.49574 37.80426
> y_d
6 9 14 17 23 24 2 12
26.48159 35.12368 38.41591 31.00840 40.54660 36.01979 31.00840 22.70478
15 19 20 22 3 4 5 8
40.47355 32.72757 29.36229 46.23494 25.24701 30.18534 24.42395 34.30063
13 21
32.72757 33.55063
I would like to create a list that returns an average of the points on each list at the same index. In other words the average of point at index 2, index 12, index 15, and etc...
> y_mean
2 6 9 12....
26.05434 31.8664 ...... ......
Any ideas on how to do that?
We may get the elements in a list, then stack it to two column data.frame, rbind and do a group by mean
dat <- do.call(rbind,
lapply(mget(ls(pattern = "^y_[a-z]$")), stack))
aggregate(values ~ ind, dat, FUN = mean)
Or use tapply
with(dat, tapply(values, ind, FUN = mean))
Or if there are only four vectors, just do
v1 <- c(y_a, y_b, y_c, y_d)
tapply(v1, names(v1), FUN = mean)

How can this R code be sped up with the apply (lapply, mapply ect.) functions?

I am not to proficient with the apply functions, or with R. But I know I overuse for loops which makes my code slow. How can the following code be sped up with apply functions, or in any other way?
sum_store = NULL
for (col in 1:ncol(cazy_fams)){ # for each column in cazy_fams (so for each master family eg. GH, AA ect...)
for (row in 1:nrow(cazy_fams)){ # for each row in cazy fams (so the specific family number e.g GH1 AA7 ect...)
# Isolating the row that pertains to the current cazy family being looked at for every dataframe in the list
filt_fam = lapply(family_summary, function(sample){
sample[as.character(sample$Family) %in% paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = ""),]
})
row_cat = do.call(rbind, filt_fam) # concatinating the lapply list output int a dataframe
if (nrow(row_cat) > 0){
fam_sum = aggregate(proteins ~ Family, data=row_cat, FUN=sum) # collapsing the dataframe into one row and summing the proteins count
sum_store = rbind(sum_store, fam_sum) # storing the results for that family
} else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) {
Family = paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")
proteins = 0
sum_store = rbind(sum_store, data.frame(Family, proteins))
} else {
next
}
}
}
family_summary is just a list of 18 two column dataframes that look like this:
Family proteins
CE0 2
CE1 9
CE4 15
CE7 1
CE9 1
CE14 10
GH0 5
GH1 1
GH3 4
GH4 1
GH8 1
GH9 2
GH13 2
GH15 5
GH17 1
with different cazy families.
cazy_fams is just a dataframe with each coulms being a cazy class (eg. GH, AA ect...) and ech row being a family number, all taken from the linked website:
GH GT PL CE AA CBM
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
The reason behind the else if (grepl("NA", paste(colnames(cazy_fams[col]),cazy_fams[row,col], sep = "")) == FALSE) statment is to deal with the fact not all classes have the same number of family so when looping over my dataframe I end up with some GHNA and AANA with NA on the end.
The output sum_store is this:
Family proteins
GH1 54
GH2 51
GH3 125
GH4 29
GH5 40
GH6 25
GH7 0
GH8 16
GH9 25
GH10 19
GH11 5
GH12 5
GH13 164
GH14 3
GH15 61
A dataframe with all listed cazy families and the total number of apperances across the family_summary list.
Please let me know if you need anything else to help answer my question.

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

under what circumstances does R recycle?

I have two variables, x (takes in 5 values) and y (takes in 11 values). When I want to run the argument,
> v <- 2*x +y +1
R responds:
Error at 2* x+y: Longer object length is not a multiple of shorter object length.
I tried: 1*x gives me 5 values of x, but y has 11 values. So R says it can’t add 11 to 5 values? – This raises the general question: Under what circumstances does recycling work?
Recycling works in your example:
> x <- seq(5)
> y <- seq(11)
> x+y
[1] 2 4 6 8 10 7 9 11 13 15 12
Warning message:
In x + y : longer object length is not a multiple of shorter object length
> v <- 2*x +y +1
Warning message:
In 2 * x + y :
longer object length is not a multiple of shorter object length
> v
[1] 4 7 10 13 16 9 12 15 18 21 14
The "error" that you reported is in fact a "warning" which means that R is notifying you that it is recycling but recycles anyway. You may have options(warn=2) turned on, which converts warnings into error messages.
In general, avoid relying on recycling. If you get in the habit of ignoring the warnings, some day it will bite you and your code will fail in some very hard to diagnose way.
It doesn't work this way. You have to have vectors of the same length:
x_samelen = c(1,2,3)
y_samelen = c(10,20,30)
x_samelen*y_samelen
[1] 10 40 90
If vectors are of the same length, the result is well defined and understood. You can do "recycling", but it really is not advisable to do so.
I wrote a short script to make your two vectors of the same length, via padding the short vector. This will let you execute your code without warnings:
x_orig <- c(1,2,3,4,5,6,7,8,9,10,11)
y_orig <- c(21,22,23,24,25)
if ( length(x_orig)>length(y_orig) ) {
x <- x_orig
y <- head(x = as.vector(t(rep(x=y_orig, times=ceiling(length(x_orig)/length(y_orig))))), n = length(x_orig) )
cat("padding y\r\n")
} else {
x <- head(x = as.vector(t(rep(x=x_orig, times=ceiling(length(y_orig)/length(x_orig))))), n = length(y_orig) )
y <- y_orig
cat("padding x\r\n")
}
The results are:
x_orig
[1] 1 2 3 4 5 6 7 8 9 10 11
y_orig
[1] 21 22 23 24 25
x
[1] 1 2 3 4 5 6 7 8 9 10 11
y
[1] 21 22 23 24 25 21 22 23 24 25 21
If you reverse x_orig and y_orig:
x_orig
[1] 21 22 23 24 25
y_orig
[1] 1 2 3 4 5 6 7 8 9 10 11
x
[1] 21 22 23 24 25 21 22 23 24 25 21
y
[1] 1 2 3 4 5 6 7 8 9 10 11

Resources