I have a question according to the max function in R. I have a column in a dataframe which has 2 decimals after the comma and whenever I apply the max function on the column i will get the highest value but with only 1 decimal after the comma.
max(df$e)
How can I get two decimals after the comma? I tried using options() and round() but nothing works.
Reproducible example:
a = c(228, 239)
b = c(50,83)
d = c(0.27,0.24)
e = c(2.12,1.69)
df = data.frame(a,b,d,e)
max(df$e)
#[1] 2.1
df
# a b d e
# 1 228 50 0.27 2.1
# 2 239 83 0.24 1.7
Now I would like to make more calculations:
df$f = (sqrt(df[,1]/(df[,2]+0.5))/max(df$e))*100
In the end the dataframe should have column a and b without decimals and d , e and f with two decimals after the comma.
Thank you!
tl;dr you've probably got options(digits = 2).
df
a b d e
1 228 50 0.27 2.12
2 239 83 0.24 1.69
If I set options(digits = 2) then R uses this value to set the output format (it doesn't change the actual values), globally, to two significant digits — this is "total digits", not "digits after the decimal point".
options(digits = 2)
df
a b d e f
1 228 50 0.27 2.1 100
2 239 83 0.24 1.7 80
To restore the value to the default, use options(digits = 7).
After restoring the default digits setting and computing df$f = (sqrt(df[,1]/(df[,2]+0.5))/max(df$e))*100 I get
a b d e f
1 228 50 0.27 2.12 100.2273
2 239 83 0.24 1.69 79.8031
You can round the last column to two decimal places:
df$f <- round(df$f, 2)
> df
a b d e f
1 228 50 0.27 2.12 100.23
2 239 83 0.24 1.69 79.80
Related
I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...
The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.
Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))
I'm performing some biogeographic analyses in R and the result is encoded as a pair of matrices. Columns represent geographic regions, rows indicate nodes in a phylogenetic tree and values in the matrix are the probability that the branching event occurred in the geographic region indicated by the column. A very simple example would be:
> One_node<-matrix(c(0,0.8,0.2,0),
+ nrow=1, ncol=4,
+ dimnames = list(c("node 1"),
+ c("A","B","C","D")))
> One_node
A B C D
node_1 0 0.8 0.2 0
In this case, the most probable location for node_1 is region B. In reality, the output of the analysis is encoded as two separate 79x123 matrices. The first is the probabilities of a node occupying a given region before an event and the second is the probabilities of a node occupying a given region after an event (rowSums=1). Some slightly more complicated examples:
before<-matrix(c(0,0,0,0,0.9,
0.8,0.2,0.6,0.4,0.07,
0.2,0.8,0.4,0.6,0.03,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
after<-matrix(c(0,0,0,0,0.9,
0.2,0.8,0.4,0.6,0.03,
0.8,0.2,0.6,0.4,0.07,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
> before
A B C D
node_1 0.0 0.80 0.20 0
node_2 0.0 0.20 0.80 0
node_3 0.0 0.60 0.40 0
node_4 0.0 0.40 0.60 0
node_5 0.9 0.07 0.03 0
> after
A B C D
node_1 0.0 0.20 0.80 0
node_2 0.0 0.80 0.20 0
node_3 0.0 0.40 0.60 0
node_4 0.0 0.60 0.40 0
node_5 0.9 0.03 0.07 0
Specifically, I'm only interested in extracting row numbers where column B is the highest in before and column C is the highest in after and vice versa as I'm trying to extract node numbers in a tree where taxa have moved B->C or C->B.
So the output I'm looking for would be something like:
> BC
[1] 1 3
> CB
[1] 2 4
There will be rows where B>C or C>B but where neither is the highest in the row (node_5) and I need to ignore these. The row numbers are then used to query a separate dataframe that provides the data I want.
I hope this all makes sense. Thanks in advance for any advice!
You could do something like this...
maxBefore <- apply(before, 1, which.max) #find highest columns before (by row)
maxAfter <- apply(after, 1, which.max) #and highest columns after
BC <- which(maxBefore==2 & maxAfter==3) #rows with B highest before, C after
CB <- which(maxBefore==3 & maxAfter==2) #rows with C highest before, B after
BC
node_1 node_3
1 3
CB
node_2 node_4
2 4
I have a data frame, with which I would like to group the intervals based on whether the integer values are consecutive or not and then find the difference between the maximum and minimum value of each group.
Example of data:
x Integers
0.1 14
0.05 15
2.7 17
0.07 19
3.4 20
0.05 21
So Group 1 would consist of 14 and 15 and Group 2 would consist of 19,20 and 21.
The difference of each group then being 1 and 2, respectively.
I have tried the following, to first group the consecutive values, with no luck.
Breaks <- c(0, which(diff(Data$Integer) != 1), length(Data$Integer))
sapply(seq(length(Breaks) - 1),
function(i) Data$Integer[(Breaks[i] + 1):Breaks[i+1]])
Here's a solution using by():
df <- data.frame(x=c(0.1,0.05,2.7,0.07,3.4,0.05),Integers=c(14,15,17,19,20,21));
do.call(rbind,by(df,cumsum(c(0,diff(df$Integers)!=1)),function(g) data.frame(imin=min(g$Integers),imax=max(g$Integers),irange=diff(range(g$Integers)),xmin=min(g$x),xmax=max(g$x),xrange=diff(range(g$x)))));
## imin imax irange xmin xmax xrange
## 0 14 15 1 0.05 0.1 0.05
## 1 17 17 0 2.70 2.7 0.00
## 2 19 21 2 0.05 3.4 3.35
I wasn't sure what data you wanted in the output, so I just included everything you might want.
You can filter out the middle group with subset(...,irange!=0).
I am looking for a quick extension to the following solution posted here. In it Frank shows that for an example data table
test <- data.table("index"=rep(letters[1:10],100),"var1"=rnorm(1000,0,1))
You can quickly make dummies by using the following code
inds <- unique(test$index) ; test[,(inds):=lapply(inds,function(x)index==x)]
Now I want to extend this solution for a data.table that has multiple rows of indices, e.g.
new <- data.table("id" = rep(c("Jan","James","Dirk","Harry","Cindy","Leslie","John","Frank"),125), "index1"=rep(letters[1:5],200),"index2" = rep(letters[6:15],100),"index3" = rep(letters[16:19],250))
I need to do this for many dummies and ideally the solution would allow me to get 4 things:
The total count of every index
The mean times every index occurs
The count of every index per id
The mean of every index per id
In my real case, the indices are named differently so the solution would need to be able to loop through the column names I think.
Thanks
Simon
If you only need the four items in that list, you should just tabulate:
indcols <- paste0('index',1:3)
lapply(new[,indcols,with=FALSE],table) # counts
lapply(new[,indcols,with=FALSE],function(x)prop.table(table(x))) # means
# or...
lapply(
new[,indcols,with=FALSE],
function(x){
z<-table(x)
rbind(count=z,mean=prop.table(z))
})
This gives
$index1
a b c d e
count 200.0 200.0 200.0 200.0 200.0
mean 0.2 0.2 0.2 0.2 0.2
$index2
f g h i j k l m n o
count 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
mean 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
$index3
p q r s
count 250.00 250.00 250.00 250.00
mean 0.25 0.25 0.25 0.25
The previous approach would work on a data.frame or a data.table, but is rather complicated. With a data.table, one can use the melt syntax:
melt(new, id="id")[,.(
N=.N,
mean=.N/nrow(new)
), by=.(variable,value)]
which gives
variable value N mean
1: index1 a 200 0.20
2: index1 b 200 0.20
3: index1 c 200 0.20
4: index1 d 200 0.20
5: index1 e 200 0.20
6: index2 f 100 0.10
7: index2 g 100 0.10
8: index2 h 100 0.10
9: index2 i 100 0.10
10: index2 j 100 0.10
11: index2 k 100 0.10
12: index2 l 100 0.10
13: index2 m 100 0.10
14: index2 n 100 0.10
15: index2 o 100 0.10
16: index3 p 250 0.25
17: index3 q 250 0.25
18: index3 r 250 0.25
19: index3 s 250 0.25
This approach was mentioned by #Arun in a comment (and implemented by him also, I think..?). To see how it works, first have a look at melt(new, id="id") which transforms the original data.table.
As mentioned in the comments, melting a data.table requires installing and loading reshape2 for some versions of the data.table package.
If you also need the dummies, they can be made in a loop as in the linked question:
newcols <- list()
for (i in indcols){
vals = unique(new[[i]])
newcols[[i]] = paste(vals,i,sep='_')
new[,(newcols[[i]]):=lapply(vals,function(x)get(i)==x)]
}
This stores the groups of columns associated with each variable in newcols for convenience. If you wanted to do the tabulation just with these dummies (instead of the underlying variables as in solution above), you could do
lapply(
indcols,
function(i) new[,lapply(.SD,function(x){
z <- sum(x)
list(z,z/.N)
}),.SDcols=newcols[[i]] ])
which gives a similar result. I just wrote it this way to illustrate how data.table syntax can be used. You could again avoid square brackets and .SD here:
lapply(
indcols,
function(i) sapply(
new[, newcols[[i]], with=FALSE],
function(x){
z<-sum(x)
rbind(z,z/length(x))
}))
But anyway: just use table if you can hold onto the underlying variables.
Essentially I'm after the product of a vector and a list of lists where the LoL has arbitrary lengths.
dose<-c(10,20,30,40,50)
resp<-list(c(.3),c(.4,.45,.48),c(.6,.59),c(.8,.76,.78),c(.9))
I can get something pretty close with
data.frame(dose,I(resp))
but it's not quite right. I need to expand out the resp column of lists pairing the values against the dose column.
The desired format is:
10 .3
20 .4
20 .45
20 .48
30 .6
30 .59
40 .8
40 .76
40 .78
50 .9
Here is a solution using rep() and unlist().
Use rep to repeat the elements of dose, with the length of each element of resp.
Use unlist to turn resp into a vector
The code:
data.frame(
dose = rep(dose, sapply(resp, length)),
resp = unlist(resp)
)
dose resp
1 10 0.30
2 20 0.40
3 20 0.45
4 20 0.48
5 30 0.60
6 30 0.59
7 40 0.80
8 40 0.76
9 40 0.78
10 50 0.90