aggregate using "factors" that are NA - r

I'm struggling to aggregate a data frame into the format I want. The data frame contains a series of parts, along with a list of tests that are performed (Length and Width), and a lower and upper limit (LL and UL) for each measurement. Some of the tests don't have one or the other limit. I'm trying to get a count of how many parts have a given "test-LL-UL" combination, including those tests with NA as one of the limits.
What I've tried so far is the following:
df<-read.table(header = TRUE, text = "
Part Test LL UL
A L 20 40
A W 5 7
B L 20 NA
B W 5 7
C L 20 40
C W 10 30
")
aggregate(data=df,Part~Test+LL+UL,FUN=length,na.action=na.pass)
This gives the following output:
Test LL UL Part
1 W 5 7 2
2 W 10 30 1
3 L 20 40 2
What I was expecting to get was:
Test LL UL Part
1 W 5 7 2
2 W 10 30 1
3 L 20 40 2
4 L 20 NA 1
Any help would be greatly appreciated!

dplyr handles this quite nicely:
library(dplyr)
df %>% group_by(Test,LL,UL) %>% summarise( n() )

Package {dplyr} can be utilized with functions group_by() and summarize():
df <- data.frame(Part = c("A","A","B","B","C","C"),
Test = c("L","W","L","W","L","W"),
LL = c(20,5,20,5,20,10),
UL = c(40,7,NA,7,40,30))
grouped <- dplyr::group_by(df, Test, LL, UL)
summarize(grouped, count = n())
## Test LL UL count
## (fctr) (dbl) (dbl) (int)
##1 L 20 40 2
##2 L 20 NA 1
##3 W 5 7 2
##4 W 10 30 1

In line with Jimbou's suggestion, the following works (but feels a little messy):
df<-read.table(header = TRUE, text = "
Part Test LL UL
A L 20 40
A W 5 7
B L 20 NA
B W 5 7
C L 20 40
C W 10 30
")
df[is.na(df)] <- "NA"
df<-aggregate(data=df,Part~Test+LL+UL,FUN=length,na.action=na.pass)
df$UL<-as.numeric(df$UL)
I think the appropriate thing to do is to set the Upper Limits to Inf and the Lower Limits to -Inf (this more accurately reflects the meaning of the limits). In this case, the aggregate works as I'd expect.

Related

Will head() and tail() functions in R change the order of output?

I know head() and tail() function will return the first or last parts of a dataset, but I wanna know if the two functions are gonna order the output, or just return without ordering them? Thanks many in advance!
As you can see below, they do keep the original order:
df <- data.frame(number = 1:26, letter = letters[1:26])
> head(df)
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
> tail(df)
number letter
21 21 u
22 22 v
23 23 w
24 24 x
25 25 y
26 26 z

How to extract diagonal elements from dataframe and store in a variable?

I have a simple 9 element dataframe.
A B C
1 8 21 1
2 40 25 32
3 10 15 49
I want to extract the diagonal elements and store it in a variable. Is there an easier way to do this other than taking one number out at a time to store to a variable?
In this case as they are all numeric you can use:
df <- data.frame(a=c(4,8,10), b = c(25,24,15), c = c(1,32,49))
df
df
a b c
1 4 25 1
2 8 24 32
3 10 15 49
Where this takes the diagonal.
diag(as.matrix(df))
[1] 4 24 49
You can use the diag function which extracts the diagonal of a matrix:
Data <- data.frame(a = c(1,2,3), b= c(11,12,13), c = c(111,112,113))
Data2 <- as.matrix(Data)
Result <- diag(Data2)
Result #Returns 1 12 113

Calculating molecular formulas out of mass of certain elements

For a chemistry project at school I want to calculate molecular masses of all possible combinations of molecular formulas including carbon (1 atom up to 100), oxygen (1 up to 50), hydrogen (1 up to 200), nitrogen (1 up to 20) and sulfur (1 up to 10) and save the results in one vector and the corresponding molecular formula string in another vector. The masses are numeric values: 12, 16, 1, 14 and 32. The strings are "C", "O", "H", "N", "S".
I want to delete molecular formulas that make no sense like C1 O100 H0 N20 S10 from the string and the corresponding mass, too. So to be more specific only leave the ones with a O/C relation between 0 and 1, a H/C relation between 2 and 1, a N/C relation between 0 and 0.2 and a S/C relation between 0 and 0.1.
Is there a easy way to do this, is using a for loop the only way or is there a faster way (maybe arrays?) and how can I take account to the relations of molecules?
Would be vary happy for some ideas or basic code to solve this.
..so #Gregor to disclude the relations of atoms that dont make sense probably will be better before the whole list is created? #Barker Yes atoms like Nitrogen should go from 0 to max. I am very new to R so when I try a loop I end up with the last value calculated...(reduced amount of dimensions).
z=matrix(0,1,5*20*10*2*2)
C=12
O=16
H=1
N=14
S=32
for( u in 1:length(z)) {
for(i in 1:5) {
for (j in 1:20) {
for(k in 1:10 ) {
for(l in 0:1) {
for(m in 0:1){
z[1,u] <- C*i+H*j+O*k+N*l+S*m
}
}
}
}
}
}
does anyone know where the mistake is here?
expand.grid is a good place to start in generating combinations. For example, to create a data.frame with combinations of H and C you could do this
mol = expand.grid(C = 1:3, H = 1:4)
mol
# C H
# 1 1 1
# 2 2 1
# 3 3 1
# 4 1 2
# 5 2 2
# 6 3 2
# 7 1 3
# 8 2 3
# 9 3 3
# 10 1 4
# 11 2 4
# 12 3 4
You can add on the other elements in expand.grid as well and also adjust the inputs up to 1:200 or however many you want. If your computer has enough memory, you'll be able to create the 10MM row data frame as specified in your question - though that is pretty big. If you could reduce the total number of combinations to 1MM it will be much easier on your memory.
The next step would be to delete rows that don't meet your ratio criteria. Here's one example, to make sure that the number of H is between 1 and 2 times the number of C:
mol = mol[mol$H >= mol$C & mol$H <= 2 * mol$C, ]
mol
# C H
# 1 1 1
# 4 1 2
# 5 2 2
# 8 2 3
# 9 3 3
# 11 2 4
# 12 3 4
Repeat steps like that for all your conditions.
Finally you can calculate the weights and put it in a new column:
mol$weight = with(mol, C * 12 + H * 1)
mol
# C H weight
# 1 1 1 13
# 4 1 2 14
# 5 2 2 26
# 8 2 3 27
# 9 3 3 39
# 11 2 4 28
# 12 3 4 40
You could use matrix multiplication for the weight calculation, but there's no need with a small number of possible elements. If you had 20 or more possible input elements it would make sense to do it that way.
Bonus! Formulas can be created with paste or paste0:
mol$formula = paste0("C", mol$C, " H", mol$H)
mol
# C H weight formula
# 1 1 1 13 C1 H1
# 4 1 2 14 C1 H2
# 5 2 2 26 C2 H2
# 8 2 3 27 C2 H3
# 9 3 3 39 C3 H3
# 11 2 4 28 C2 H4
# 12 3 4 40 C3 H4
Of course, most of these still won't make chemical sense - C1 H1 isn't something that would really exist, but maybe you can come up with even smarter conditions to get rid of more of the impossibilities!

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Expand Records to create Edges for igraph

I have a dataset that has multiple data points I want to map. iGraph uses 1-1 relationships though so I'm looking for a way to take one long record into many 1-1 records. For Example:
test <- data.frame(
drug1=c("A","B","C","D","E","F","G","H","I","J","K"),
drug2=c("P","O","R","T","L","A","N","D","R","A","D"),
drug3=c("B","O","R","I","S","B","E","C","K","E","R"),
age=c(15,20,35,1,35,58,51,21,54,80,75))
Which gives this output
drug1 drug2 drug3 age
1 A P B 15
2 B O O 20
3 C R R 35
4 D T I 1
5 E L S 35
6 F A B 58
7 G N E 51
8 H D C 21
9 I R K 54
10 J A E 80
11 K D R 75
I'd like to make a new table with drug1-drug2 and then stack drug2-drug3 into the previous column. So it would look like this.
drug1 drug2 age
1 A P 15
2 P B 15
3 C R 20
4 R R 20
5 E L 35
drug2 is held in the drug1 spot and drug3 is moved to drug1. I realize I can do this by creating multiple smaller steps, but was was wondering if anyone new of a way to loop this process. I have up to 11 fields.
Here are the smaller steps.
a <- test[,c("drug1","drug2","age")]
b <- test[,c("drug2","drug3","age")]
names(b) <- c("drug1","drug2","age")
test2 <- rbind(a,b)
drug1 drug2 age
1 A P 15
2 B O 20
3 C R 35
4 D T 1
5 E L 35
6 F A 58
7 G N 51
8 H D 21
9 I R 54
10 J A 80
11 K D 75
12 P B 15
13 O O 20
14 R R 35
15 T I 1
16 L S 35
17 A B 58
18 N E 51
19 D C 21
20 R K 54
21 A E 80
22 D R 75
So if you have many fields, here's a helper function which can pull down the data into pairs.
pulldown <- function(data, cols=1:(min(attr)-1),
attr=ncol(data), newnames=names(data)[c(cols[1:2], attr)]) {
if(is.character(attr)) attr<-match(attr, names(data))
if(is.character(cols)) cols<-match(cols, names(data))
do.call(rbind, lapply(unname(data.frame(t(embed(cols,2)))), function(x) {
`colnames<-`(data[, c(sort(x), attr)], newnames)
}))
}
You can run it with your data with
pulldown(test)
It has a parameter called attr where you can specify the columns (index or names) you would like repeated every row (here I have it default to the last column). Then the cols parameter is a vector of all the columns that you would like to turn into pairs. (The default is the beginning to one before the first attr). You can also specify a vector of newnames for the columns as they come out.
With three columns your method is pretty simple, this might be a better choice for 11 columns.
Slightly more compact and a one-liner would be:
test2 <- rbind( test[c("drug1","drug2","age")],
setNames(test[c("drug3", "drug2", "age")], c("drug1", "drug2", "age"))
)
The setNames function can be useful when column names are missing or need to be coerced to something else.

Resources