cast function results arguments imply differing number of rows: 9, 0 - r

I am trying to find the mean for each variable uniquely but I don't know why it's giving error after applying cast function.
library(reshape)
> odata <- read.csv("dummy2.csv")
> msdata <- melt(odata, id=c("A","F"))
> subjmeans <- cast(msdata, A~ variable, mean)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 9, 0
Here is the data I have used.
Timestamp A B C D E F G H I J
2586 01_Antwerpen_S1.jpg 9 250 1151 458 p1 color 261.8472837 13.27605282 50.20731621
2836 01_Antwerpen_S1.jpg 10 150 1371 316 p1 color 41.01219331 2.088502575 25.59470566
2986 01_Antwerpen_S1.jpg 11 283 1342 287 p1 color 580.2206477 28.92031693 84.62469724
3269 01_Antwerpen_S1.jpg 12 433 762 303 p1 color 138.1303732 7.026104125 36.45742907
3702 01_Antwerpen_S1.jpg 13 183 624 297 p1 color 88.20430828 4.489909458 30.87780081
3885 01_Antwerpen_S1.jpg 14 333 712 303 p1 color 42.20189569 2.149072905 25.72796039
4218 01_Antwerpen_S1.jpg 15 300 753 293 p1 color 51.7880295 2.637077062 26.80156954
6517 01_Antwerpen_S1.jpg 22 333 601 674 p1 color 466.0525721 23.40488212 72.49074066
9066 02_Berlin_S1.jpg 27 149 1067 681 p1 color 90.42676595 4.602920212 31.12642447
9215 02_Berlin_S1.jpg 28 266 1116 757 p1 color 101.8430165 5.18328435 32.40322557
9481 02_Berlin_S1.jpg 29 217 1020 723 p1 color 314.3962468 15.90906187 55.99993612
9698 02_Berlin_S1.jpg 30 183 711 781 p1 color 272.045952 13.78825606 51.33416332
9881 02_Berlin_S1.jpg 31 183 439 776 p1 color 249.9939999 12.68008164 48.8961796
10064 02_Berlin_S1.jpg 32 167 328 552 p1 color 193.8375609 9.847751174 42.66505258
10231 02_Berlin_S1.jpg 33 400 310 359 p1 color 68.00735254 3.462531847 28.61757006
10631 02_Berlin_S1.jpg 34 666 246 336 p1 color 93.40770846 4.754485399 31.45986788
11297 02_Berlin_S1.jpg 35 333 172 279 p1 color 1105.224412 52.32154317 136.107395
13679 03_Bordeaux_S1.jpg 40 316 1152 790 p1 color 280.8629559 14.23062355 52.30737182
13995 03_Bordeaux_S1.jpg 41 583 1424 860 p1 color 134.1827113 6.825784964 36.01672692
14578 03_Bordeaux_S1.jpg 42 283 1486 979 p1 color 133.9589489 6.814429158 35.99174415
14861 03_Bordeaux_S1.jpg 43 233 1419 863 p1 color 282.1772493 14.29652823 52.4523621
15094 03_Bordeaux_S1.jpg 44 266 1149 781 p1 color 998.5128943 47.86171758 126.2957787
17559 04_Köln_S1.jpg 49 200 151 813 p1 color 590.041524 29.38880547 85.65537204
17759 04_Köln_S1.jpg 50 183 741 806 p1 color 294.9779653 14.93791111 53.86340444
17943 04_Köln_S1.jpg 51 216 1035 782 p1 color 81.0246876 4.124771083 30.07449638
18159 04_Köln_S1.jpg 52 117 1068 708 p1 color 85.80209788 4.367748556 30.60904682
Result is same and error is same with IRIS Data too.
library(reshape)
ss <- iris
msdata <- melt(ss, id=c("Sepal.Length","Species"))
subjmeans <- cast(msdata, Species~ variable, mean)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 9, 0

Related

Gompertz-Makeham parameter estimation

I would like estimate the parameters of the Gompert-Makeham distribution, but I haven't got a result.
I would like a method in R, like this Weibull parameter estimation code:
weibull_loglik <- function(parm){
gamma <- parm[1]
lambda <- parm[2]
loglik <- sum(dweibull(vec, shape=gamma, scale=lambda, log=TRUE))
return(-loglik)
}
weibull <- nlm(weibull_loglik,parm<-c(1,1), hessian = TRUE, iterlim=100)
weibull$estimate
c=weibull$estimate[1];b=weibull$estimate[2]
My data:
[1] 872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
[16] 20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
[31] 50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
[46] 374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
[61] 894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
[76] 2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
[91] 2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
[106] 12 6 3 2 1 1
summary(vec)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 32.0 314.0 900.9 1355.0 4054.0
It would be nice to have a reproducible example, but something like:
library(bbmle)
library(eha)
set.seed(101)
vec <- rmakeham(1000, shape = c(2,3), scale = 2)
dmwrap <- function(x, shape1, shape2, scale, log) {
res <- try(dmakeham(x, c(shape1, shape2), scale, log = log), silent = TRUE)
if (inherits(res, "try-error")) return(NA)
res
}
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=1),
data = data.frame(y = vec),
method = "Nelder-Mead"
)
Define a wrapper that (1) takes shape parameters as separate values; (2) returns NA rather than throwing an error when e.g. parameters are negative
Use Nelder-Mead rather than default BFGS for robustness
the fitdistrplus package might help too
if you're going to do a lot of this it may help to fit parameters on the log scale (i.e. use parameters logshape1, etc., and use exp(logshape1) etc. in the fitting formula)
I had to work a little harder to fit your data; I scaled the variable by 1000 (and found that I could only compute the log-likelihood; the likelihood gave an error that I didn't bother trying to track down). Unfortunately, it doesn't look like a great fit (too many small values).
x <- scan(text = "872 52 31 26 22 17 11 17 17 8 20 12 25 14 17
20 17 23 32 37 28 24 43 40 34 29 26 32 34 51
50 67 84 70 71 137 123 137 172 189 212 251 248 272 314
374 345 411 494 461 505 506 565 590 535 639 710 733 795 786
894 963 1019 1149 1185 1356 1354 1460 1622 1783 1843 2049 2262 2316 2591
2730 2972 3187 3432 3438 3959 3140 3612 3820 3478 4054 3587 3433 3150 2881
2639 2250 1850 1546 1236 966 729 532 375 256 168 107 65 39 22
12 6 3 2 1 1")
m1 <- mle2(y ~ dmwrap(shape1, shape2, scale),
start = list(shape1=1,shape2=1, scale=10000),
data = data.frame(y = x/1000),
method = "Nelder-Mead"
)
cc <- as.list(coef(m1))
png("gm.png")
hist(x,breaks = 25, freq=FALSE)
with(cc,
curve(exp(dmwrap(x/1000, shape1, shape2, scale, log = TRUE))/1000, add = TRUE)
)
dev.off()

Use t.test in a data.frame using categories and specific values within

I have the following data.frame(ignore row numbers please ) :
row country measurement sampleNr Temperature
46 Germany P 379 28.800
47 Germany P 380 28.950
48 Germany P 381 28.850
139 Control P 181 28.265
140 Control P 182 28.205
141 Control P 183 28.095
142 Control P 382 28.440
143 Control P 383 28.090
144 Control P 384 28.265
190 France P 376 28.965
191 France P 377 29.000
192 France P 378 29.030
238 USA P 190 29.675
239 USA P 191 29.170
240 USA P 192 28.725
286 Cyprus P 373 29.750
287 Cyprus P 374 29.715
288 Cyprus P 375 30.295
334 Malta P 184 28.430
335 Malta P 185 28.140
336 Malta P 186 28.575
382 Japan P 187 29.220
383 Japan P 188 29.490
384 Japan P 189 29.240
46 Germany P 379 28.800
47 Germany P 380 28.950
48 Germany P 381 28.850
139 Control M 181 28.265
140 Control M 182 28.205
141 Control M 183 28.095
142 Control M 382 28.440
143 Control M 383 28.090
144 Control M 384 28.265
190 France M 376 28.965
191 France M 377 29.000
192 France M 378 29.030
238 USA M 190 29.675
239 USA M 191 29.170
240 USA M 192 28.725
286 Cyprus M 373 29.750
287 Cyprus M 374 29.715
288 Cyprus M 375 30.295
334 Malta M 184 28.430
335 Malta M 185 28.140
336 Malta M 186 28.575
382 Japan M 187 29.220
383 Japan M 188 29.490
384 Japan M 189 29.240
I would like to perform a t.test on the Control Vs any other country per measurement. Is there a way to do this using the formula function inside a t.test ? I think it is not possible , is there another efficient way to do this?
At the moment i am using for loops in a combination with the which(function) to iterate over the groups of measurements and countries ( for loop in a for loop ) then get the values ( mostly three , 6 for the control ) and put these in a t.test. But this is very inefficient.
We can try
library(data.table)
dfN <- subset(df, country == "Control")
split(dfN1, dfN1$measurement)
rbindlist(Map(function(x, y) as.data.table(x)[, .(pval = t.test(Temperature, y$Temperature)$p.value) , country],
split(dfN1, dfN1$measurement), split(dfN, dfN$measurement)),
idcol = 'measurement')

Divide paired matching columns

I have a data.frame df with matching columns that are also paired. The matching columns are defined in the factor patient. I would like to devide the matching columns by each other. Any suggestions how to do this?
I tried this, but this does not take the pairing from patient into account.
m1 <- m1[sort(colnames(df)]
m1_g <- m1[,grep("^n",colnames(df))]
m1_r <- m1[,grep("^t",colnames(df))]
m1_new <- m1_g/m1_r
m1_new
head(df)
na-008 ta-008 nc012 tb012 na020 na-018 ta-018 na020 tc020 tc093 nc093
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGT 56 311 137 242 23 96 113 106 41 114
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGG 208 656 350 713 49 476 183 246 157 306
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGT 631 1978 1531 2470 216 1906 732 850 665 909
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTT 2760 8159 6067 9367 622 4228 2931 3031 2895 2974
hsa-let-7b-5p_TGAGGTAGTAGGTTGTGTGGTTT 1698 4105 3737 3729 219 1510 1697 1643 1527 1536
> head(patient)
$`008`
[1] "na-008" "ta-008"
$`012`
[1] "nc012" "tb012"
$`018`
[1] "na-018" "ta-018"
$`020`
[1] "na020" "tc020"
$`045`
[1] "nb045" "tc045"
$`080`
[1] "nb-080" "ta-080"

generalizing net/gross in a bar chart

I'm doing a particular operation quite a bit, and I need help generalizing it.
I have a lot of data that "looks" kind of like this:
> hflights::hflights %>% tbl_df %>% mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=sum(ArrDelay, na.rm=T)) %>%
dcast(month ~ carrier)
month AA AS B6 CO DL EV F9 FL MQ OO UA US WN XE YV
1 1 18 296 229 27031 1026 1337 851 216 2322 3957 -219 -1068 31701 24248 NA
2 2 461 249 802 15769 1657 730 707 1079 4283 11486 323 -663 36729 27861 -44
3 3 317 476 1037 49061 905 2529 673 1111 2524 12955 1665 -606 28758 50702 -38
4 4 1147 465 518 52086 1856 4483 515 927 5085 17439 1803 -711 47084 69590 260
5 5 1272 56 654 63413 1381 3563 1334 1213 7899 22190 1798 1627 73771 66972 18
6 6 -262 172 504 60042 3736 2618 744 983 4519 21652 6260 2140 40191 66456 49
7 7 -460 112 1241 41300 2868 1628 321 506 1529 23432 2780 497 21200 98484 34
8 8 -1417 59 1659 36106 -949 808 42 -1366 310 11038 3546 -84 6991 33554 34
9 9 -841 -364 -202 24857 1022 -424 151 -747 -1373 4502 1743 248 15592 31846 NA
10 10 215 -112 -45 26437 1082 -1005 277 -537 522 13 1833 -1878 14725 27539 NA
11 11 97 -5 -72 20339 -101 207 180 449 2286 2628 230 -1093 8424 24199 NA
12 12 2287 -242 310 6644 1281 -1082 585 79 2311 5900 -491 -951 12735 65269 NA
There are positive and negative values with some groups; in this case, month & carrier. I can plot it like this:
> hflights::hflights %>% tbl_df %>% mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=mean(ArrDelay, na.rm=T)) %>%
ggplot(aes(x=month, y=delay, fill=carrier)) + geom_bar(stat='identity')
Which gives me an eye-bleedy chart like this:
It also gives me the message:
Warning message:
Stacking not well defined when ymin != 0
This message is kind of what I'm after. I want to separate positive from negative so that I can see the "gross" amount, and also generate the sum per group and show the "net" amount.
For this dataset, I can do that like so:
> df <- hflights::hflights %>% tbl_df %>%
mutate(month=Month, carrier=UniqueCarrier) %>%
group_by(month, carrier) %>% summarize(delay=mean(ArrDelay, na.rm=T))
> ggplot(NULL, aes(x=month, y=delay, fill=carrier)) +
geom_bar(data=df %>% filter(delay > 0), stat='identity') +
geom_bar(data=df %>% filter(delay < 0), stat='identity') +
geom_bar(data=df %>% group_by(month) %>% summarize(delay=sum(delay, na.rm=T)), fill='black', width=0.25, alpha=0.5, stat='identity')
Which gives me this chestnut:
This is much nicer because in September, it doesn't do netting so I get a better sense of the magnitude of the positives and the negatives.
However, the above only works for this dataset. What happens when I have different groups? How do I generalize this?
Adding position = "identity" to geom_bar should get rid of the warning you are getting in your first plot.
The reason for this warning is related to interpreting that bars have negative height instead of just negative values.

marking a point in the boxplot

I am plotting three different sets as three boxplot in 1 page using ggplot2. In each set there is a point that I would like to highlight, and illustrate where the point stands compare to the others, is it inside the box ? or the outside.
Here is my datapoint
CDH 1KG NHLBI
CDH 301 688 1762
RS0 204 560 21742
RS1 158 1169 1406
RS2 182 1945 1467
RS3 256 2371 1631
RS4 198 580 1765
RS5 193 524 1429
RS6 139 2551 1469
RS7 188 702 1584
RS8 142 4311 1461
RS9 223 916 1591
RS10 250 794 1406
RS11 185 539 1270
RS12 228 641 1786
RS13 152 557 1677
RS14 225 1970 1619
RS15 196 458 1543
RS16 203 2891 1528
RS17 221 1542 1780
RS18 258 1173 1850
RS19 202 718 1651
RS20 191 6314 1564
library(ggplot2)
rm(list = ls())
orig_table = read.table("thedata.csv", header = T, sep = ",")
bb = orig_table # have copy of the data
bb = bb[,-1] # since these points, the ones in the first raw are my interesting point, I exclude them from the sets for the time being
tt = bb
mydata = cbind(c(tt[,1], tt[,2], tt[,3]), c(rep(1,22),rep(2,22),rep(3,22))) # I form the dataframe
data2 = cbind(c(301,688,1762),c(1,2,3)) # here is the points that I want to highlight - similar to the first raw
colnames(data2) = c("num","gro")
data2 = as.data.frame(data2) # I form them as a dataframe
colnames(mydata) = c("num","gro")
mydata = as.data.frame(mydata)
mydata$gro = factor(mydata$gro, levels=c(1,2,3))
qplot(gro, num, data=mydata, geom=c("boxplot"))+scale_y_log10() # I am making the dataframe out of 21 other ponts
# and here I want to highlight those three values in the "data2" dataframe
I appreciate your help
First, ggplot is a lot easier to use if you use data in long format. melt from reshape2 helps with that:
library(reshape2)
library(ggplot2)
df$highlight <- c(TRUE, rep(FALSE, nrow(df) - 1L)) # tag first row as interesting
df.2 <- melt(df) # convert df to long format
ggplot(subset(df.2, !highlight), aes(x=variable, y=value)) +
geom_boxplot() + scale_y_log10() +
geom_point( # add the highlight points
data=subset(df.2, highlight),
aes(x=variable, y=value),
color="red", size=5
)
Now, all I did was add a TRUE, to the first row, melted the data to be compatible with ggplot, and plotted the points with highlight==TRUE in addition to the boxplots.
EDIT: this is how I made the data:
df <- read.table(text=" CDH 1KG NHLBI
CDH 301 688 1762
RS0 204 560 21742
RS1 158 1169 1406
RS2 182 1945 1467
RS3 256 2371 1631
RS4 198 580 1765
RS5 193 524 1429
RS6 139 2551 1469
RS7 188 702 1584
RS8 142 4311 1461
RS9 223 916 1591
RS10 250 794 1406
RS11 185 539 1270
RS12 228 641 1786
RS13 152 557 1677
RS14 225 1970 1619
RS15 196 458 1543
RS16 203 2891 1528
RS17 221 1542 1780
RS18 258 1173 1850
RS19 202 718 1651
RS20 191 6314 1564", header=T)

Resources