Sorting a column in a data frame in R

Sorting a column in a data frame in R - r

*I wanted to arrange the column "TotalConfirmedCases" in descending order but it sorted in a weird way like 965 is arranged first.
CODE in R: new_Cor_table[rev(order(new_Cor_table$TotalConfirmedCases)),]
Output:

Update: thanks to input of #onyambu:
We could use order with decreasing=TRUE:
newdata <- df[order(df$TotalConfirmedCases, decreasing = TRUE),]
OR
If we want to do it with rev then here is the syntax:
newdata <- df[rev(order(df$TotalConfirmedCases)),]
newdata
County TotalConfirmedCases Totalprobablecases Totalcases Totaldeaths
3 Dakota 95277 23,252 118,529 792
7 Anoka 83623 20,459 104,082 808
26 Washington 57910 14,193 72,103 490
30 Stearns 50672 2,622 53,294 372
34 Olmsted 44718 1,048 45,766 191
36 St. Louis 43103 8,153 51,256 541
2 Douglas 9534 1,962 11,496 118
5 Isanti 8892 1,645 10,537 119
4 Morrison 8892 616 9,508 105
6 Freeborn 8753 679 9,432 77
8 Nicollet 8244 385 8,629 66
9 Becker 7877 1,292 9,169 95
11 Polk 7319 1,852 9,171 109
12 Carlton 7203 2,451 9,654 100
13 Mille Lacs 6962 578 7,540 116
15 Cass 6687 668 7,355 83
16 Todd 6605 486 7,091 61
17 Lyon 6503 759 7,262 74
18 Brown 6460 330 6,790 81
19 Le Sueur 6294 449 6,743 51
21 Pine 6141 1,319 7,460 68
22 Nobles 6025 1,044 7,069 60
23 Dodge 5916 144 6,060 22
24 Meeker 5803 361 6,164 75
25 Wabasha 5795 172 5,967 19
28 Waseca 5314 424 5,738 39
29 Martin 5273 549 5,822 65
31 Fillmore 4953 117 5,070 24
32 Hubbard 4579 556 5,135 60
33 Houston 4498 320 4,818 20
35 Roseau 4327 281 4,608 45
37 Faribault 3759 213 3,972 54
38 Redwood 3661 417 4,078 54
39 Wadena 3636 754 4,390 56
1 Kittson 965 109 1,074 28
10 Lake\tof the Woods 771 34 805 6
14 Red Lake 692 269 961 13
20 Cook 620 12 632 4
27 Traverse 577 313 890 10
>
data:
structure(list(County = c("Kittson", "Douglas", "Dakota", "Morrison",
"Isanti", "Freeborn", "Anoka", "Nicollet", "Becker", "Lake\tof the Woods",
"Polk", "Carlton", "Mille Lacs", "Red Lake", "Cass", "Todd",
"Lyon", "Brown", "Le Sueur", "Cook", "Pine", "Nobles", "Dodge",
"Meeker", "Wabasha", "Washington", "Traverse", "Waseca", "Martin",
"Stearns", "Fillmore", "Hubbard", "Houston", "Olmsted", "Roseau",
"St. Louis", "Faribault", "Redwood", "Wadena"), TotalConfirmedCases = c(965L,
9534L, 95277L, 8892L, 8892L, 8753L, 83623L, 8244L, 7877L, 771L,
7319L, 7203L, 6962L, 692L, 6687L, 6605L, 6503L, 6460L, 6294L,
620L, 6141L, 6025L, 5916L, 5803L, 5795L, 57910L, 577L, 5314L,
5273L, 50672L, 4953L, 4579L, 4498L, 44718L, 4327L, 43103L, 3759L,
3661L, 3636L), Totalprobablecases = c("109", "1,962", "23,252",
"616", "1,645", "679", "20,459", "385", "1,292", "34", "1,852",
"2,451", "578", "269", "668", "486", "759", "330", "449", "12",
"1,319", "1,044", "144", "361", "172", "14,193", "313", "424",
"549", "2,622", "117", "556", "320", "1,048", "281", "8,153",
"213", "417", "754"), Totalcases = c("1,074", "11,496", "118,529",
"9,508", "10,537", "9,432", "104,082", "8,629", "9,169", "805",
"9,171", "9,654", "7,540", "961", "7,355", "7,091", "7,262",
"6,790", "6,743", "632", "7,460", "7,069", "6,060", "6,164",
"5,967", "72,103", "890", "5,738", "5,822", "53,294", "5,070",
"5,135", "4,818", "45,766", "4,608", "51,256", "3,972", "4,078",
"4,390"), Totaldeaths = c(28L, 118L, 792L, 105L, 119L, 77L, 808L,
66L, 95L, 6L, 109L, 100L, 116L, 13L, 83L, 61L, 74L, 81L, 51L,
4L, 68L, 60L, 22L, 75L, 19L, 490L, 10L, 39L, 65L, 372L, 24L,
60L, 20L, 191L, 45L, 541L, 54L, 54L, 56L)), class = "data.frame", row.names = c(NA,
-39L))

I suggest using the rank function, with a negative sign it will reverse the order
new_Cor_table[order (-rank (new_Cor_table$TotalConfirmedCases)),]

Related

Rowwise proportion test and add p value as new column

My data:
c5 =structure(list(comorbid = c("heart", "ihd", "cabg", "angio",
"cerebrovasc", "diabetes", "pvd", "amputation", "liver", "malig",
"smoke", "ulcers"), AVF_Y = c(626L, 355L, 266L, 92L, 320L, 1175L,
199L, 89L, 75L, 450L, 901L, 114L), AVG_Y = c(54L, 14L, 18L, 5L,
21L, 37L, 5L, 7L, 5L, 29L, 33L, 3L), AVF_tot = c(2755L, 1768L,
2770L, 2831L, 2844L, 2877L, 1745L, 2823L, 2831L, 2823L, 2798L,
2829L), AVG_tot = c(161L, 61L, 161L, 165L, 166L, 167L, 61L, 165L,
165L, 165L, 159L, 164L)), row.names = c(NA, -12L), class = "data.frame")
I want to perform a prop.test for each row ( a two-proportions z-test) and add the p value as a new column.
I've tried using the following code, but this gives me 24 1-sample proportions test results instead of 12 2-sample test for equality of proportions.
Map(prop.test, x = c(c5$AVF_Y, c5$AVG_Y), n = c(c5$AVF_tot, c5$AVG_tot))

Use a lambda function and extract. When we concatenate the columns, it returns a vector and its length will be 2 times the number of rows of the data. We would need to concatenate within in the loop to create a vector of length 2 for each x and n from corresponding columns of '_Y', and '_tot'
mapply(function(avf, avg, avf_n, avg_n) prop.test(c(avf, avg), c(avf_n, avg_n))$p.value, c5$AVF_Y, c5$AVG_Y, c5$AVF_tot, c5$AVG_tot)
-output
[1] 2.218376e-03 6.985883e-01 6.026012e-01 1.000000e+00 6.695440e-01 2.425781e-06 5.672322e-01 5.861097e-01 9.627050e-01 6.546286e-01 3.360300e-03 2.276857e-0
Or use do.cal with Map or mapply
do.call(mapply, c(FUN = function(x, y, n1, n2)
prop.test(c(x, y), c(n1, n2))$p.value, unname(c5[-1])))
[1] 2.218376e-03 6.985883e-01 6.026012e-01 1.000000e+00 6.695440e-01 2.425781e-06 5.672322e-01 5.861097e-01 9.627050e-01 6.546286e-01 3.360300e-03 2.276857e-01
Or with apply
apply(c5[-1], 1, function(x) prop.test(x[1:2], x[3:4])$p.value)
[1] 2.218376e-03 6.985883e-01 6.026012e-01 1.000000e+00 6.695440e-01 2.425781e-06 5.672322e-01 5.861097e-01 9.627050e-01 6.546286e-01 3.360300e-03 2.276857e-01
Or use rowwise
library(dplyr)
c5 %>%
rowwise %>%
mutate(pval = prop.test(c(AVF_Y, AVG_Y),
n = c(AVF_tot, AVG_tot))$p.value) %>%
ungroup
-output
# A tibble: 12 × 6
comorbid AVF_Y AVG_Y AVF_tot AVG_tot pval
<chr> <int> <int> <int> <int> <dbl>
1 heart 626 54 2755 161 0.00222
2 ihd 355 14 1768 61 0.699
3 cabg 266 18 2770 161 0.603
4 angio 92 5 2831 165 1.00
5 cerebrovasc 320 21 2844 166 0.670
6 diabetes 1175 37 2877 167 0.00000243
7 pvd 199 5 1745 61 0.567
8 amputation 89 7 2823 165 0.586
9 liver 75 5 2831 165 0.963
10 malig 450 29 2823 165 0.655
11 smoke 901 33 2798 159 0.00336
12 ulcers 114 3 2829 164 0.228

Ggplot visualization improve

I have a data frame from a Bici service, it looks like this, where Origen_Id is the station's number, and Num_Viaje_Ori is the total number of trips that start in that station.
Origen_Id
Num_F
Num_M
Num_Viaje_Ori
Destino_Id
Num_F_d
Num_M_d
Num_Viaje_Des
11
1616
3973
5589
11
139
5 3855
5250
34
962
3232
4194
34
1340
4236
5576
35
1321
3993
5314
35
1418
4239
5657
50
1797
4293
6090
50
1785
4314
6099
51
1891
5186
7077
51
3084
7771
10855
52
1379
4320
5699
52
1299
3913
5212
54
1275
3950
5225
54
1373
4046
5419
75
1332
2939
4271
75
1202
2763
3965
194
1346
3792
5138
194
632
1845
2477
271
1511
3640
5151
271
1483
3750
5233
When I run
s<-ggplot(most, aes(x=Origen_Id, y=Num_Viaje_Ori))+geom_bar(stat="identity")
I got
How can I fix it?, I mean, how can I make the bars got closer?

Implementing the commented suggestions, you should get:
library(tidyverse)
library(tibble)
library(ggthemes)
most <-
tibble::tribble(
~Origen_Id, ~Num_F, ~Num_M, ~Num_Viaje_Ori, ~Destino_Id, ~Num_F_d, ~Num_M_d, ~Num_Viaje_Des,
11L, 1616L, 3973L, 5589L, 11L, 139L, "5 3855", 5250L,
34L, 962L, 3232L, 4194L, 34L, 1340L, "4236", 5576L,
35L, 1321L, 3993L, 5314L, 35L, 1418L, "4239", 5657L,
50L, 1797L, 4293L, 6090L, 50L, 1785L, "4314", 6099L,
51L, 1891L, 5186L, 7077L, 51L, 3084L, "7771", 10855L,
52L, 1379L, 4320L, 5699L, 52L, 1299L, "3913", 5212L,
54L, 1275L, 3950L, 5225L, 54L, 1373L, "4046", 5419L,
75L, 1332L, 2939L, 4271L, 75L, 1202L, "2763", 3965L,
194L, 1346L, 3792L, 5138L, 194L, 632L, "1845", 2477L,
271L, 1511L, 3640L, 5151L, 271L, 1483L, "3750", 5233L
)
most %>%
mutate(Origen_Id = as.factor(Origen_Id)) %>%
ggplot(aes(x=Origen_Id, y=Num_Viaje_Ori)) +
geom_col(fill = "darkslateblue") +
ggthemes::theme_economist_white()
Created on 2021-11-23 by the reprex package (v2.0.1)

Calculate average based on columns in 2 datafarmes and their values via mutate in R?

I have a dataframe structure that calculates the sum of Response.Status found per month with this mutate function:
DF1 <- complete_df %>%
mutate(Month = format(as.Date(date, format = "%Y/%m/%d"), "%m/%Y"),
UNSUBSCRIBE = if_else(UNSUBSCRIBE == "TRUE", "UNSUBSCRIBE", NA_character_)) %>%
pivot_longer(c(Response.Status, UNSUBSCRIBE), values_to = "Response.Status") %>%
drop_na() %>%
count(Month, Response.Status) %>%
pivot_wider(names_from = Month, names_sep = "/", values_from = n)
# A tibble: 7 x 16
Response.Status `01/2020` `02/2020` `03/2020` `04/2020` `05/2020` `06/2020` `07/2020` `08/2020` `09/2019` `09/2020` `10/2019` `10/2020` `11/2019` `11/2020` `12/2019`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 EMAIL_OPENED 1068 3105 4063 4976 2079 1856 4249 3638 882 4140 865 2573 1167 684 862
2 NOT_RESPONDED 3187 9715 13164 15239 5458 4773 12679 10709 2798 15066 2814 8068 3641 1931 2647
3 PARTIALLY_SAVED 5 34 56 8 28 22 73 86 11 14 7 23 8 8 2
4 SUBMITTED 216 557 838 828 357 310 654 621 214 1001 233 497 264 122 194
5 SURVEY_OPENED 164 395 597 1016 245 212 513 625 110 588 123 349 202 94 120
6 UNDELIVERED_OR_BOUNCED 92 280 318 260 109 127 319 321 63 445 69 192 93 39 74
7 UNSUBSCRIBE 397 1011 1472 1568 727 737 1745 2189 372 1451 378 941 429 254 355
What I would like to do is take those values created in table to calculate average based on # of people in each Response.Status group.
structure(list(Response.Status = c("EMAIL_OPENED", "NOT_RESPONDED",
"PARTIALLY_SAVED", "SUBMITTED", "SURVEY_OPENED", "UNDELIVERED_OR_BOUNCED"
), `01/2020` = c(1068L, 3187L, 5L, 216L, 164L, 92L), `02/2020` = c(3105L,
9715L, 34L, 557L, 395L, 280L), `03/2020` = c(4063L, 13164L, 56L,
838L, 597L, 318L), `04/2020` = c(4976L, 15239L, 8L, 828L, 1016L,
260L), `05/2020` = c(2079L, 5458L, 28L, 357L, 245L, 109L), `06/2020` = c(1856L,
4773L, 22L, 310L, 212L, 127L), `07/2020` = c(4249L, 12679L, 73L,
654L, 513L, 319L), `08/2020` = c(3638L, 10709L, 86L, 621L, 625L,
321L), `09/2019` = c(882L, 2798L, 11L, 214L, 110L, 63L), `09/2020` = c(4140L,
15066L, 14L, 1001L, 588L, 445L), `10/2019` = c(865L, 2814L, 7L,
233L, 123L, 69L), `10/2020` = c(2573L, 8068L, 23L, 497L, 349L,
192L), `11/2019` = c(1167L, 3641L, 8L, 264L, 202L, 93L), `11/2020` = c(684L,
1931L, 8L, 122L, 94L, 39L), `12/2019` = c(862L, 2647L, 2L, 194L,
120L, 74L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I made a separate table that contains sum values based on those group names:
Response.Status
EMAIL_OPENED : 451
NOT_RESPONDED : 1563
PARTIALLY_SAVED : 4
SUBMITTED : 71
SURVEY_OPENED : 53
UNDELIVERED_OR_BOUNCED: 47
UNSUBSCRIBE: 135

If I understood your problem correctly you have 2 data.frame/tibbles. One that is shown in the "structure" part an one that informs the quantity of people/users per response status. Now you want to get the value per person. If so this is a possible solution:
# people/users data set
df2 <- data.frame(Response.Status = c("EMAIL_OPENED", "NOT_RESPONDED", "PARTIALLY_SAVED", "SUBMITTED", "SURVEY_OPENED", "UNDELIVERED_OR_BOUNCED", "UNSUBSCRIBE"),
PEOPLE = c(451, 1563, 4, 71, 53, 47, 135))
df %>% # this is your "structure"
tidyr::pivot_longer(-Response.Status, names_to = "DATE", values_to = "nmbr") %>%
dplyr::group_by(Response.Status) %>%
dplyr::summarise(SUM = sum(nmbr)) %>%
dplyr::inner_join(df2) %>%
dplyr::mutate(MEAN_PP = SUM / PEOPLE)
Response.Status SUM PEOPLE MEAN_PP
<chr> <int> <dbl> <dbl>
1 EMAIL_OPENED 36207 451 80.3
2 NOT_RESPONDED 111889 1563 71.6
3 PARTIALLY_SAVED 385 4 96.2
4 SUBMITTED 6906 71 97.3
5 SURVEY_OPENED 5353 53 101
6 UNDELIVERED_OR_BOUNCED 2801 47 59.6

how to make a summary table [duplicate]

I have the following data frame (df) of 29 observations of 5 variables:
age height_seca1 height_chad1 height_DL weight_alog1
1 19 1800 1797 180 70
2 19 1682 1670 167 69
3 21 1765 1765 178 80
4 21 1829 1833 181 74
5 21 1706 1705 170 103
6 18 1607 1606 160 76
7 19 1578 1576 156 50
8 19 1577 1575 156 61
9 21 1666 1665 166 52
10 17 1710 1716 172 65
11 28 1616 1619 161 66
12 22 1648 1644 165 58
13 19 1569 1570 155 55
14 19 1779 1777 177 55
15 18 1773 1772 179 70
16 18 1816 1809 181 81
17 19 1766 1765 178 77
18 19 1745 1741 174 76
19 18 1716 1714 170 71
20 21 1785 1783 179 64
21 19 1850 1854 185 71
22 31 1875 1880 188 95
23 26 1877 1877 186 106
24 19 1836 1837 185 100
25 18 1825 1823 182 85
26 19 1755 1754 174 79
27 26 1658 1658 165 69
28 20 1816 1818 183 84
29 18 1755 1755 175 67
I wish to obtain the mean, standard deviation, median, minimum, maximum and sample size of each of the variables and get an output as a data frame. I tried using the code below but then the it becomes impossible for me to work with and using tapply or aggregate seems to be beyond me as a novice R programmer. My assignment requires me not use any 'extra' R packages.
apply(df, 2, mean)
apply(df, 2, sd)
apply(df, 2, median)
apply(df, 2, min)
apply(df, 2, max)
apply(df, 2, length)
Ideally, this is how the output data frame should look like including the row headings for each of the statistical functions:
age height_seca1 height_chad1 height_DL weight_alog1
mean 20 1737 1736 173 73
sd 3.3 91.9 92.7 9.7 14.5
median 19 1755 1755 175 71
minimum 17 1569 1570 155 50
maximum 31 1877 1880 188 106
sample size 29 29 29 29 29
Any help would be greatly appreciated.

Try with basicStats from fBasics package
> install.packages("fBasics")
> library(fBasics)
> basicStats(df)
age height_seca1 height_chad1 height_DL weight_alog1
nobs 29.000000 29.000000 29.000000 29.000000 29.000000
NAs 0.000000 0.000000 0.000000 0.000000 0.000000
Minimum 17.000000 1569.000000 1570.000000 155.000000 50.000000
Maximum 31.000000 1877.000000 1880.000000 188.000000 106.000000
1. Quartile 19.000000 1666.000000 1665.000000 166.000000 65.000000
3. Quartile 21.000000 1816.000000 1809.000000 181.000000 80.000000
Mean 20.413793 1737.241379 1736.482759 173.379310 73.413793
Median 19.000000 1755.000000 1755.000000 175.000000 71.000000
Sum 592.000000 50380.000000 50358.000000 5028.000000 2129.000000
SE Mean 0.612910 17.069018 17.210707 1.798613 2.700354
LCL Mean 19.158305 1702.277081 1701.228224 169.695018 67.882368
UCL Mean 21.669282 1772.205677 1771.737293 177.063602 78.945219
Variance 10.894089 8449.189655 8590.044335 93.815271 211.465517
Stdev 3.300619 91.919474 92.682492 9.685828 14.541854
Skewness 1.746597 -0.355499 -0.322915 -0.430019 0.560360
Kurtosis 2.290686 -1.077820 -1.086108 -1.040182 -0.311017
You can also subset the output to get what you want:
> basicStats(df)[c("Mean", "Stdev", "Median", "Minimum", "Maximum", "nobs"),]
age height_seca1 height_chad1 height_DL weight_alog1
Mean 20.413793 1737.24138 1736.48276 173.379310 73.41379
Stdev 3.300619 91.91947 92.68249 9.685828 14.54185
Median 19.000000 1755.00000 1755.00000 175.000000 71.00000
Minimum 17.000000 1569.00000 1570.00000 155.000000 50.00000
Maximum 31.000000 1877.00000 1880.00000 188.000000 106.00000
nobs 29.000000 29.00000 29.00000 29.000000 29.00000
Another alternative is that you define your own function as in this post.
Update:
(I hadn't read the "My assignment requires me not use any 'extra' R packages." part)
As I said before, you can define your own function and loop over each column by using *apply family functions:
my.summary <- function(x,...){
c(mean=mean(x, ...),
sd=sd(x, ...),
median=median(x, ...),
min=min(x, ...),
max=max(x,...),
n=length(x))
}
# all these calls should give you the same results.
apply(df, 2, my.summary)
sapply(df, my.summary)
do.call(cbind,lapply(df, my.summary))

Or using what you have already done, you just need to put those summaries into a list and use do.call
df <- structure(list(age = c(19L, 19L, 21L, 21L, 21L, 18L, 19L, 19L, 21L, 17L, 28L, 22L, 19L, 19L, 18L, 18L, 19L, 19L, 18L, 21L, 19L, 31L, 26L, 19L, 18L, 19L, 26L, 20L, 18L), height_seca1 = c(1800L, 1682L, 1765L, 1829L, 1706L, 1607L, 1578L, 1577L, 1666L, 1710L, 1616L, 1648L, 1569L, 1779L, 1773L, 1816L, 1766L, 1745L, 1716L, 1785L, 1850L, 1875L, 1877L, 1836L, 1825L, 1755L, 1658L, 1816L, 1755L), height_chad1 = c(1797L, 1670L, 1765L, 1833L, 1705L, 1606L, 1576L, 1575L, 1665L, 1716L, 1619L, 1644L, 1570L, 1777L, 1772L, 1809L, 1765L, 1741L, 1714L, 1783L, 1854L, 1880L, 1877L, 1837L, 1823L, 1754L, 1658L, 1818L, 1755L), height_DL = c(180L, 167L, 178L, 181L, 170L, 160L, 156L, 156L, 166L, 172L, 161L, 165L, 155L, 177L, 179L, 181L, 178L, 174L, 170L, 179L, 185L, 188L, 186L, 185L, 182L, 174L, 165L, 183L, 175L), weight_alog1 = c(70L, 69L, 80L, 74L, 103L, 76L, 50L, 61L, 52L, 65L, 66L, 58L, 55L, 55L, 70L, 81L, 77L, 76L, 71L, 64L, 71L, 95L, 106L, 100L, 85L, 79L, 69L, 84L, 67L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29"))
tmp <- do.call(data.frame,
list(mean = apply(df, 2, mean),
sd = apply(df, 2, sd),
median = apply(df, 2, median),
min = apply(df, 2, min),
max = apply(df, 2, max),
n = apply(df, 2, length)))
tmp
mean sd median min max n
age 20.41379 3.300619 19 17 31 29
height_seca1 1737.24138 91.919474 1755 1569 1877 29
height_chad1 1736.48276 92.682492 1755 1570 1880 29
height_DL 173.37931 9.685828 175 155 188 29
weight_alog1 73.41379 14.541854 71 50 106 29
or...
data.frame(t(tmp))
age height_seca1 height_chad1 height_DL weight_alog1
mean 20.413793 1737.24138 1736.48276 173.379310 73.41379
sd 3.300619 91.91947 92.68249 9.685828 14.54185
median 19.000000 1755.00000 1755.00000 175.000000 71.00000
min 17.000000 1569.00000 1570.00000 155.000000 50.00000
max 31.000000 1877.00000 1880.00000 188.000000 106.00000
n 29.000000 29.00000 29.00000 29.000000 29.00000

You can use lapply to go over each column and an anonymous function to do each of your calculations:
res <- lapply( mydf , function(x) rbind( mean = mean(x) ,
sd = sd(x) ,
median = median(x) ,
minimum = min(x) ,
maximum = max(x) ,
s.size = length(x) ) )
data.frame( res )
# age height_seca1 height_chad1 height_DL weight_alog1
#mean 20.413793 1737.24138 1736.48276 173.379310 73.41379
#sd 3.300619 91.91947 92.68249 9.685828 14.54185
#median 19.000000 1755.00000 1755.00000 175.000000 71.00000
#minimum 17.000000 1569.00000 1570.00000 155.000000 50.00000
#maximum 31.000000 1877.00000 1880.00000 188.000000 106.00000
#s.size 29.000000 29.00000 29.00000 29.000000 29.00000

Adding few more options for quick Exploratory Data Analysis (EDA)
1) skimr package:
install.packages("skimr")
library(skimr)
skim(df)
2) ExPanDaR package:
install.packages("ExPanDaR")
library(ExPanDaR)
# export data and code to a notebook
ExPanD(df, export_nb_option = TRUE)
# open a shiny app
ExPanD(df)
3) DescTools package:
install.packages("DescTools")
library(DescTools)
Desc(df, plotit = TRUE)
#> ------------------------------------------------------------------------------
#> Describe df (data.frame):
#>
#> data frame: 29 obs. of 5 variables
#> 29 complete cases (100.0%)
#>
#> Nr ColName Class NAs Levels
#> 1 age integer .
#> 2 height_seca1 integer .
#> 3 height_chad1 integer .
#> 4 height_DL integer .
#> 5 weight_alog1 integer .
#>
#>
#> ------------------------------------------------------------------------------
#> 1 - age (integer)
#>
#> length n NAs unique 0s mean meanCI
#> 29 29 0 9 0 20.41 19.16
#> 100.0% 0.0% 0.0% 21.67
#>
#> .05 .10 .25 median .75 .90 .95
#> 18.00 18.00 19.00 19.00 21.00 26.00 27.20
#>
#> range sd vcoef mad IQR skew kurt
#> 14.00 3.30 0.16 1.48 2.00 1.75 2.29
#>
#>
#> level freq perc cumfreq cumperc
#> 1 17 1 3.4% 1 3.4%
#> 2 18 6 20.7% 7 24.1%
#> 3 19 11 37.9% 18 62.1%
#> 4 20 1 3.4% 19 65.5%
#> 5 21 5 17.2% 24 82.8%
#> 6 22 1 3.4% 25 86.2%
#> 7 26 2 6.9% 27 93.1%
#> 8 28 1 3.4% 28 96.6%
#> 9 31 1 3.4% 29 100.0%
#>
#> heap(?): remarkable frequency (37.9%) for the mode(s) (= 19)
Results from Desc can be saved to a Microsoft Word docx file
### RDCOMClient package is needed
install.packages("RDCOMClient", repos = "http://www.omegahat.net/R")
# or
devtools::install_github("omegahat/RDCOMClient")
# create a new word instance and insert title and contents
wrd <- GetNewWrd(header = TRUE)
DescTools::Desc(df, plotit = TRUE, wrd = wrd)
Created on 2020-01-17 by the reprex package (v0.3.0)

So far I had the same problem and I wrote ...
h <- function(x, flist){
f <- function(f,...)f(...)
g <- function(x, flist){vapply(flist, f , x, FUN.VALUE = numeric(1))}
df <- as.data.frame(lapply(x, g , flist))
row.names(df) <- names(flist)
df
}
h(cars, flist = list(mean = mean, median = median, std_dev = sd))
it should work with any function as specified in flist as long as the function returns a single value; i.e. it wont work with range
Note that elements of flist should be named otherwise, you'll get strange row.names for the resulting data.frame

Automatic extraction of p-value from data.frame

I want to compare protein expression values (n=465 proteins) for two groups of patients (resistant vs. sensitive).
I have 11 resistant patients and 8 sensitive patients. I would like to compare (ttest) expression values of protein 1 of the resistant group (A res to K res) with that of the sensitive group (L sens to S sens), protein 2 (resistant) with protein 2 (sensitive), and so on. As an output I want only the proteins where the p-value is <0.05.
I tried to do this (see below), but there is something wrong and I can not figure out what.
X Protein.1 Protein.2 Protein.3 Protein.4 Protein.5 Protein.6
1 A res 4127 16886 1785 1636 407 135
2 B res 10039 32414 3144 1543 601 154
3 C res 527 1059 1637 317 229 107
4 D res 553 3848 7357 1168 1549 441
5 E res 2351 2272 5868 2606 517 159
6 F res 822 1767 2110 818 293 75
7 G res 673 1887 511 471 214 NA
8 H res 5769 2206 2041 517 355 298
9 I res 1660 4221 1921 629 383 104
10 J res 3281 1804 2400 225 268 52
11 K res 3383 1882 1935 185 NA NA
12 L sens 10810 20136 2350 1143 527 160
13 M sens 5941 14873 3550 943 308 NA
14 N sens 1100 2325 1359 561 542 284
15 O sens 85 587 619 364 85 52
16 P sens 2321 6335 6494 994 NA NA
17 Q sens 103810 7102 7986 1464 439 187
18 R sens 1174 2076 1423 340 186 70
19 S sens 1829 973 1343 380 453 221
data <- read.csv("ProteinDataResSens.csv", sep=";", na.strings="weak", header=TRUE)
res <- data.frame(data[1:11, ], row.names=NULL)
colnames(res) <- paste("res", 1:length(res), sep="_")
sens <- data.frame(data[12:19, ], row.names=NULL)
colnames(sens) <- paste("sens", 1:length(sens), sep="_")
com <- combn(c(colnames(res), colnames(sens)), 2)
p <- apply(com, 2, function(x) t.test(data[, x[1]], data[, x[2]])$p.val)
data.frame(comparison=paste(com[1, ], com[2, ],sep=" vs."), p.value=p)
Thank you very much for any help!

If you want to compare the res against sens for each Protein columns
grp <- sub(".* ", "", df$X)
Pvals <- mapply(function(x,y) t.test(x[grp=='res'],
x[grp=='sens'])$p.value, df[,-1], list(grp))
Pvals[Pvals < 0.05]
Or using data.table
library(data.table)
setDT(df)[, grp:= sub('.* ', "", X)][, lapply(.SD,
function(x) t.test(x[grp=='res'], x[grp=='sens'])$p.value),
.SDcols=2:(ncol(df)-1)]
data
df <- structure(list(X = c("A res", "B res", "C res", "D res", "E res",
"F res", "G res", "H res", "I res", "J res", "K res", "L sens",
"M sens", "N sens", "O sens", "P sens", "Q sens", "R sens", "S sens"
), Protein.1 = c(4127L, 10039L, 527L, 553L, 2351L, 822L, 673L,
5769L, 1660L, 3281L, 3383L, 10810L, 5941L, 1100L, 85L, 2321L,
103810L, 1174L, 1829L), Protein.2 = c(16886L, 32414L, 1059L,
3848L, 2272L, 1767L, 1887L, 2206L, 4221L, 1804L, 1882L, 20136L,
14873L, 2325L, 587L, 6335L, 7102L, 2076L, 973L), Protein.3 = c(1785L,
3144L, 1637L, 7357L, 5868L, 2110L, 511L, 2041L, 1921L, 2400L,
1935L, 2350L, 3550L, 1359L, 619L, 6494L, 7986L, 1423L, 1343L),
Protein.4 = c(1636L, 1543L, 317L, 1168L, 2606L, 818L, 471L,
517L, 629L, 225L, 185L, 1143L, 943L, 561L, 364L, 994L, 1464L,
340L, 380L), Protein.5 = c(407L, 601L, 229L, 1549L, 517L,
293L, 214L, 355L, 383L, 268L, NA, 527L, 308L, 542L, 85L,
NA, 439L, 186L, 453L), Protein.6 = c(135L, 154L, 107L, 441L,
159L, 75L, NA, 298L, 104L, 52L, NA, 160L, NA, 284L, 52L,
NA, 187L, 70L, 221L)), .Names = c("X", "Protein.1", "Protein.2",
"Protein.3", "Protein.4", "Protein.5", "Protein.6"), class =
"data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sorting a column in a data frame in R - r

*I wanted to arrange the column "TotalConfirmedCases" in descending order but it sorted in a weird way like 965 is arranged first. CODE in R: new_Cor_table[rev(order(new_Cor_table$TotalConfirmedCases)),] Output:

I suggest using the rank function, with a negative sign it will reverse the order new_Cor_table[order (-rank (new_Cor_table$TotalConfirmedCases)),]

Related

Rowwise proportion test and add p value as new column

Ggplot visualization improve

Calculate average based on columns in 2 datafarmes and their values via mutate in R?

how to make a summary table [duplicate]

Automatic extraction of p-value from data.frame

Categories

Resources