Maximum based on custom order - r

I want to calculate highest price based on custom orders as follows (largest being first).
Hundred Million
Million
THO
Hundred
Data is as below -
df <- read.table(text = "Price Unit
1445 Million
620 THO
830 Million
661 Million
783 Hundred
349 'Hundred Million'
", header= T)

If you wish to also calculate the "actual price", we can:
first create a dataframe of "Unit" and "Value" (for example price_unit in my answer).
Then left_join this price_unit with your original dataframe, which will match on the "Unit" column.
Then do the calculation using mutate.
Finally sort the column.
library(tidyverse)
df <- read.table(text = "Price Unit
1445 Million
620 THO
830 Million
661 Million
783 Hundred
349 'Hundred Million'
", header= T)
price_unit <- tibble(Unit = c("THO", "Hundred", "Million", "Hundred Million"),
Value = c(10^3, 10^2, 10^6, 10^8))
left_join(df, price_unit, by = "Unit") %>%
mutate(actual_price = Price * Value) %>%
arrange(desc(actual_price))
Price Unit Value actual_price
1 349 Hundred Million 1e+08 3.490e+10
2 1445 Million 1e+06 1.445e+09
3 830 Million 1e+06 8.300e+08
4 661 Million 1e+06 6.610e+08
5 620 THO 1e+03 6.200e+05
6 783 Hundred 1e+02 7.830e+04

First you can create a factor for your Unit variable by ordering them in the levels command:
df$Unit <- factor(df$Unit,
levels = c("THO",
"Hundred",
"Million",
"Hundred Million"))
Then just arrange by unit, which should arrange them by smallest unit to largest:
df %>%
arrange(Unit,
Price)
Which gives you this output:
Price Unit
1 620 THO
2 783 Hundred
3 661 Million
4 830 Million
5 1445 Million
6 349 Hundred Million

Related

dplyr sample_n returns different number of rows in table

I am working with dplyr and sample_n in R and trying to get an even group of rows to work with in my data frame.
So, I have a data set, head of data as follows:
> head(SEH)
Time.Level Demo.Age SEH.Total
92 PRE 12 110
335 PRE 12 80
720 MID 14 85
196 MID 11 95
408 POST 18 60
184 POST 10 99
I separated out the data into three different data frames according to time level. So I have a SEH.pre, an SEH.mid and an SEH.post. I then do a describe and I know I have uneven groups of pre, mid, post. So, I want to random sample out pre, mid, post groups to be an even size. For example, I have the SEH.pre and SEH.mid group n sizes below:
> describe(SEH.pre)
vars n
Time.Level* 1 887
Demo.Age 2 883
SEH.Total 3 887
> describe(SEH.mid)
vars n
Time.Level* 1 894
Demo.Age 2 872
SEH.Total 3 894
So, now I run sample_n on the SEH.pre thinking that I can re-sample to an n of 860 across all columns. I run the following command:
SEH.pre2 <- sample_n(SEH.pre, 860, replace = FALSE)
And then I describe and the Demo.Age is less than the rest:
> describe(SEH.pre2)
vars n ...
Time.Level* 1 860
Demo.Age 2 856
SEH.Total 3 860
I feel like a big idiot but I cannot figure out why this is. I have tried it multiple times and Demo.Age varies from 856 to 859, but is never 860. I want all three columns to be 860. How do I do this? And why am I mis-thinking that sample_n should create even groups out of uneven?

Loop only iterate first 4 rows

My for loop only iterates the first 4 rows of the R dataframe. I read several similar postings and tried suggested approaches but none work. Any help is appreciated
df_total <- list()
for (i in 1:length(df_test)) {
df <- recover(df_test[i,], "PI", 1)
df$i <-i
df_total[[i]] <- df
}
big_data = do.call(rbind, df_total)
row_1 row_2 correct incorrect newrow1 newrow2
56245270 8549 9949 71 3 8550 9950
9332380 896 9949 71 1 897 9950
14783792 1460 4943 70 2 1461 4944
41437670 4943 10388 70 0 4944 10389
9323891 896 1460 70 2 897 1461
Note that length(df) gives you the number of columns of a data.frame. If you want the number of rows, use nrow(df).
Ideally you would use
seq(nrow(df))
to generate an index for a for loop, looping over the rows of a data.frame.

Adding values of two columns on the same row to get a new value

Sorry for asking a very basic question but I am new to R and really stuck on a rather simple matter; I have the data frame below (2 rows and 7 columns):
Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166
These values correspond with time duration (secs) for seven test conditions
col$names <- c(sup_b, hdt, sup_2, lbnp, sup_3, hut, sup_4)
and 17 rows (each row is for one study subject- I have only included first two rows).
I am trying to add values from row 1 col$sup_b (175) and row 1 col$hdt (434) to get the combined duration for the first two conditions i.e. 609 secs. I then add the value of the previous two cols (609) to the next col$sup_2 to get the total duration (609 + 596) and so on until the last condition col$sup_4.
I have tried the method below which is for subject 6 (row 1), which works fine, but I want to tidy this up and make it easier as I have 17 subjects (rows) and have been advised there is an easier way around this:
sup_b <- 175
hdt <- (sup_b + 434)
sup_2 <- (hdt + 596)
lbnp <- (sup_2 + 585)
sup_3 <- (hdt_lbnp + 601)
hut <- (sup_3 + 593)
sup_4 <- (hut + 211)
I want to be able to just change the number of row and have the data pulled across from the data frame rather than entering each individual time period; for instance:
line <- 1 ### the row I want which corresponds to the subject
sup_b <- df[line, 2]
hdt <-df[line, 2] + df[line, 3]
but I keep getting this warning message:
In Ops.factor(df[line, 2], df[line, 3]) : ‘+’ not meaningful for factor
I have even tried: colSums(df[,c(2:3)]), but get the following warning:
Error in colSums(df[, c(2:3)]) : 'x' must be numeric.
also tried: st$sum <- apply(df[,c(2:3)], 1, sum), which doesn't work either.
df1[-1] <- t(apply(df1[-1],1,cumsum))
# Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
# 1 6 175 609 1205 1790 2391 2984 3195
# 2 7 130 722 1314 1907 2507 2891 3057
data
df1 <- read.table(text="Sub sup_b hdt sup_2 lbnp sup_3 hut sup_4
6 175 434 596 585 601 593 211
7 130 592 592 593 600 384 166",h=T,strin=F)

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

R: looping through data.frame columns

I got a following my_data:
geneid chr acc_no start end size strand S1 S2 A1 A2
1 gene_010010 1 AC12345.1 3662 4663 1002 - 328 336 757 874
2 gene_010020 1 AC12345.1 5750 7411 1662 - 480 589 793 765
3 gene_010030 2 AC12345.1 9003 11024 2022 - 653 673 875 920
4 gene_010040 2 AC12345.1 12006 12566 561 - 573 623 483 430
5 gene_010050 3 AC12345.1 15035 17032 1998 - 2256 2333 1866 1944
6 gene_010060 3 AC12345.1 18188 18937 750 - 526 642 650 586
I am able to calculate sums for a given column, i.e:
chr.sums <- data.frame(with (my_data, tapply(S1, INDEX=chr, FUN=sum)))
Problem is, I want to get chr.sums with four columns (S1, S2, A1 and A2) and 30 rows corresponding to unique chr numbers. I do not want to switch to Python back and forth, but looping through columns and assigning output to specific columns in data.frame baffles me.
EDIT
Toy data set above.
You can use ddply from plyr. Here is some code:
plyr::ddply(my_data, .(chr), summarize, S1 = sum(S1), S2 = sum(S2),
A1 = sum(A1), A2 = sum(A2))
EDIT. A more compact solution would be:
plyr::ddply(my_data, .(chr), colwise(sum, .(S1, S2, A1, A2)))
Here is how it works. The data is first split into pieces based on chr. Then, the columns S1, S2, A1, A2 are summed up for each piece. Finally, they are assembled back into a single data frame.
Any place you have this kind of a split-apply-combine problem, think plyr as a solution.
tapply won't handle multiple columns but the formula version of aggregate will.
chr.sums <- aggregate(cbind(S1,S2,A1,A2) ~ chr, data = my_data, FUN=sum)))

Resources