I'm new to R and my question might seem easy for most of you. I have a data like this
> data.frame(table(dat),total)
AGEintervals mytest.G_B_FLAG Freq total
1 (1,23] 0 5718 5912
2 (23,26] 0 5249 5579
3 (26,28] 0 3105 3314
4 (28,33] 0 6277 6693
5 (33,37] 0 4443 4682
6 (37,41] 0 4277 4514
7 (41,46] 0 4904 5169
8 (46,51] 0 4582 4812
9 (51,57] 0 4039 4236
10 (57,76] 0 3926 4031
11 (1,23] 1 194 5912
12 (23,26] 1 330 5579
13 (26,28] 1 209 3314
14 (28,33] 1 416 6693
15 (33,37] 1 239 4682
16 (37,41] 1 237 4514
17 (41,46] 1 265 5169
18 (46,51] 1 230 4812
19 (51,57] 1 197 4236
20 (57,76] 1 105 4031
As you might have noticed age intervals start to repeating on 11 row.
All I need is to get 10 rows and 0's and 1' in different columns. Like this
AGEintervals 1 0 total
1 (1,23] 194 5718 5912
2 (23,26] 330 5249 5579
3 (26,28] 209 3105 3314
4 (28,33] 416 6277 6693
5 (33,37] 239 4443 4682
6 (37,41] 237 4277 4514
7 (41,46] 265 4904 5169
8 (46,51] 230 4582 4812
9 (51,57] 197 4039 4236
10 (57,76] 105 3926 4031
Many thanks
This is a straightforward "long" to "wide" transformation that is easy to achieve with reshape from base R:
reshape(mydf, idvar = c("AGEintervals", "total"),
timevar = "mytest.G_B_FLAG", direction = "wide")
# AGEintervals total Freq.0 Freq.1
# 1 (1,23] 5912 5718 194
# 2 (23,26] 5579 5249 330
# 3 (26,28] 3314 3105 209
# 4 (28,33] 6693 6277 416
# 5 (33,37] 4682 4443 239
# 6 (37,41] 4514 4277 237
# 7 (41,46] 5169 4904 265
# 8 (46,51] 4812 4582 230
# 9 (51,57] 4236 4039 197
# 10 (57,76] 4031 3926 105
Other alternatives include:
reshape2
library(reshape2)
dcast(mydf, ... ~ mytest.G_B_FLAG, value.var='Freq')
tidyr
library(tidyr)
spread(df, mytest.G_B_FLAG, Freq)
Update
This problem is possibly avoidable in the first place.
Run the following example code and compare the output at each stage:
## Create some sample data
set.seed(1)
dat <- data.frame(V1 = sample(letters[1:3], 20, TRUE),
V2 = sample(c(0, 1), 20, TRUE))
## View the output
dat
## Look what happens when we use `data.frame` on a `table`
data.frame(table(dat))
## Compare it with `as.data.frame.matrix`
as.data.frame.matrix(table(dat))
## The total can be added automatically with `addmargins`
as.data.frame.matrix(addmargins(table(dat), 2, sum))
Related
I have a dataframe like this:
> df
# 1 2 3 4 5 6 7 8 9 10
# ENSG00000000003 2407 2345 1052 2191 2542 812 3595 4215 1100 5457
# ENSG00000000005 0 5 0 0 1 0 1 0 12 0
# ENSG00000000419 1843 1528 1520 1789 1144 1946 2017 2794 1455 2258
# ENSG00000000457 611 536 496 637 621 687 966 774 822 3026
# ENSG00000000460 453 493 884 1180 338 541 606 650 520 3479
# ENSG00000000938 249 296 995 113 1073 233 333 4441 2708 404
# ENSG00000000971 3570 1126 2431 1395 6452 7677 8222 1188 20762 4111
# ENSG00000001036 3774 1573 3323 1958 2029 2022 4236 1641 4195 1313
and want to select the following genes:
genes <- c("ENSG00000000003", "ENSG00000000460", "ENSG00000001084")
Why do I get incorrect result when selecting the rows by this way:
> df[factor(genes), ]
# 1 2 3 4 5 6 7 8 9 10
# ENSG00000000003 2407 2345 1052 2191 2542 812 3595 4215 1100 5457
# ENSG00000000005 0 5 0 0 1 0 1 0 12 0
# ENSG00000000419 1843 1528 1520 1789 1144 1946 2017 2794 1455 2258
and correct by this one: ?
> df[as.vector(genes), ]
# 1 2 3 4 5 6 7 8 9 10
# ENSG00000000003 2407 2345 1052 2191 2542 812 3595 4215 1100 5457
# ENSG00000000460 453 493 884 1180 338 541 606 650 520 3479
# ENSG00000001084 3705 6465 1803 49162 2018 1161 4621 8359 3375 2678
Rownames of df are strings, but in another dataframe I have the same names as factors. To have correct results I have to put it into as.vector() all the time.
Can you tell me what is the logic of the first result?
factors are internally numbers. So when you are trying to subset the dataframe using factor it returns you the first 3 results of your dataframe. Check
(1:10)[factor(genes)]
#[1] 1 2 3
So here from sequence 1:10 it returns to you first 3 values.
This works for dataframes as well,
mtcars[factor(genes), ]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.62 16.5 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.0 0 1 4 4
#Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
If genes are rownames of your dataframe you can subset your dataframe directly by doing
df[genes, ]
Can any one help how to find approximate area under the curve using Riemann Sums in R?
It seems we do not have any package in R which could help.
Sample data:
MNo1 X1 Y1 MNo2 X2 Y2
1 2981 -66287 1 595 -47797
1 2981 -66287 1 595 -47797
2 2973 -66087 2 541 -47597
2 2973 -66087 2 541 -47597
3 2963 -65887 3 485 -47397
3 2963 -65887 3 485 -47397
4 2952 -65687 4 430 -47197
4 2952 -65687 4 430 -47197
5 2942 -65486 5 375 -46998
5 2942 -65486 5 375 -46998
6 2935 -65286 6 322 -46798
6 2935 -65286 6 322 -46798
7 2932 -65086 7 270 -46598
7 2932 -65086 7 270 -46598
8 2936 -64886 8 222 -46398
8 2936 -64886 8 222 -46398
9 2948 -64685 9 176 -46198
9 2948 -64685 9 176 -46198
10 2968 -64485 10 135 -45999
10 2968 -64485 10 135 -45999
11 2998 -64284 11 97 -45799
11 2998 -64284 11 97 -45799
12 3035 -64084 12 65 -45599
12 3035 -64084 12 65 -45599
13 3077 -63883 13 37 -45399
13 3077 -63883 13 37 -45399
14 3122 -63683 14 14 -45199
14 3122 -63683 14 14 -45199
15 3168 -63482 15 -5 -44999
15 3168 -63482 15 -5 -44999
16 3212 -63282 16 -20 -44799
16 3212 -63282 16 -20 -44799
17 3250 -63081 17 -31 -44599
17 3250 -63081 17 -31 -44599
18 3280 -62881 18 -38 -44399
18 3280 -62881 18 -38 -44399
19 3301 -62680 19 -43 -44199
19 3301 -62680 19 -43 -44199
20 3313 -62480 20 -45 -43999
Check this demo :
> library(zoo)
> x <- 1:10
> y <- -x^2
> Result <- sum(diff(x[x]) * rollmean(y[x], 2))
> Result
[1] -334.5
After check this question, I found function trapz() from package pracma be more efficient:
> library(pracma)
> Result.2 <- trapz(x, y)
> Result.2
[1] -334.5
I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.
dataset:
zip acs.pop napps pperct cgrp zgrp perc
1: 12007 97 2 2.0618557 2 1 25.000000
2: 12007 97 2 2.0618557 NA 2 50.000000
3: 12007 97 2 2.0618557 1 1 25.000000
4: 12008 485 2 0.4123711 2 1 33.333333
5: 12008 485 2 0.4123711 4 1 33.333333
6: 12008 485 2 0.4123711 NA 1 33.333333
7: 12009 7327 187 2.5522042 4 76 26.206897
8: 12009 7327 187 2.5522042 1 41 14.137931
9: 12009 7327 187 2.5522042 2 23 7.931034
10: 12009 7327 187 2.5522042 NA 103 35.517241
11: 12009 7327 187 2.5522042 3 47 16.206897
12: 12010 28802 580 2.0137490 NA 275 32.163743
13: 12010 28802 580 2.0137490 4 122 14.269006
14: 12010 28802 580 2.0137490 1 269 31.461988
15: 12010 28802 580 2.0137490 2 96 11.228070
16: 12010 28802 580 2.0137490 3 93 10.877193
17: 12018 7608 126 1.6561514 3 30 16.129032
18: 12018 7608 126 1.6561514 NA 60 32.258065
19: 12018 7608 126 1.6561514 2 14 7.526882
20: 12018 7608 126 1.6561514 4 57 30.645161
21: 12018 7608 126 1.6561514 1 25 13.440860
22: 12019 14841 144 0.9702850 NA 62 30.097087
23: 12019 14841 144 0.9702850 4 73 35.436893
24: 12019 14841 144 0.9702850 3 30 14.563107
25: 12019 14841 144 0.9702850 1 23 11.165049
26: 12019 14841 144 0.9702850 2 18 8.737864
27: 12020 31403 343 1.0922523 3 76 14.960630
28: 12020 31403 343 1.0922523 1 88 17.322835
29: 12020 31403 343 1.0922523 2 38 7.480315
30: 12020 31403 343 1.0922523 4 141 27.755906
31: 12020 31403 343 1.0922523 NA 165 32.480315
32: 12022 1002 5 0.4990020 NA 4 44.444444
33: 12022 1002 5 0.4990020 4 2 22.222222
34: 12022 1002 5 0.4990020 3 1 11.111111
35: 12022 1002 5 0.4990020 1 1 11.111111
I know the reshape2 or reshape package can handle this, but I'm not sure how. I need the final output to look like this:
zip acs.pop napps pperct zgrp4 zgrp3 zgrp2 zgrp1 perc4 perc3 perc2 perc1
12009 7327 187 2.5522042 76 47 23 41 26.206897 16.206897 7.931034 14.137931
zip is the id
acs.pop, napps, pperct will be the same for each zip group
zgrp4…zgrp1 are the values of zgrp for each value of cgrp
perc4…perc1 are the values of perc for each value of cgrp
We can try dcast from the devel version of data.table which can take multiple value.var columns. In this case, we have 'zgrp' and 'perc' are the value columns. Using the grouping variables, we create an sequence variable ('ind') and then use dcast to convert from 'long' to 'wide' format.
Instructions to install the devel version are here
library(data.table)#v1.9.5
setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
# zip acs.pop napps pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp 1_perc
#1: 12007 97 2 2.0618557 1 2 1 NA NA 25.00000
#2: 12008 485 2 0.4123711 1 1 1 NA NA 33.33333
#3: 12009 7327 187 2.5522042 76 41 23 103 47 26.20690
#4: 12010 28802 580 2.0137490 275 122 269 96 93 32.16374
#5: 12018 7608 126 1.6561514 30 60 14 57 25 16.12903
#6: 12019 14841 144 0.9702850 62 73 30 23 18 30.09709
#7: 12020 31403 343 1.0922523 76 88 38 141 165 14.96063
#8: 12022 1002 5 0.4990020 4 2 1 1 NA 44.44444
# 2_perc 3_perc 4_perc 5_perc
#1: 50.00000 25.000000 NA NA
#2: 33.33333 33.333333 NA NA
#3: 14.13793 7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807 7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505 8.737864
#7: 17.32284 7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111 NA
Or we can use ave/reshape from base R
df2 <- transform(df1, ind=ave(seq_along(zip), zip,
acs.pop, napps, pperct, FUN=seq_along))
reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
timevar='ind', direction='wide')
This is a good use for spread() in tidyr.
df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc
Example output:
> df2[df2$zip==12009,]
Source: local data frame [1 x 12]
zip acs.pop napps pperct perc_1 perc_2 perc_3 perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009 7327 187 2.552204 14.13793 7.931034 16.2069 26.2069 41 23 47 76
Thanks to #akrun for the assist
I have a data like this
> bbT11
range X0 X1 total BR GDis BDis WOE IV Index
1 (1,23] 5718 194 5912 0.03281461 12.291488 8.009909 0.42822753 1.83348973 1.534535
2 (23,26] 5249 330 5579 0.05915039 11.283319 13.625103 -0.18858848 0.44163352 1.207544
3 (26,28] 3105 209 3314 0.06306578 6.674549 8.629232 -0.25685394 0.50206815 1.292856
4 (28,33] 6277 416 6693 0.06215449 13.493121 17.175888 -0.24132650 0.88874916 1.272937
5 (33,37] 4443 239 4682 0.05104656 9.550731 9.867878 -0.03266713 0.01036028 1.033207
6 (37,41] 4277 237 4514 0.05250332 9.193895 9.785301 -0.06234172 0.03686928 1.064326
7 (41,46] 4904 265 5169 0.05126717 10.541702 10.941371 -0.03721203 0.01487247 1.037913
8 (46,51] 4582 230 4812 0.04779717 9.849527 9.496284 0.03652287 0.01290145 1.037198
9 (51,57] 4039 197 4236 0.04650614 8.682287 8.133774 0.06526000 0.03579599 1.067437
10 (57,76] 3926 105 4031 0.02604813 8.439381 4.335260 0.66612734 2.73386708 1.946684
I need to add an additional column "Bin" that will show numbers from 1 to 10, depending on BR column being in descending order, so for example 10th row becomes first, then first row becomes second, etc.
Any help would be appreciated
A very straightforward way is to use one of the rank functions from "dplyr" (eg: dense_rank, min_rank). Here, I've actually just used rank from base R. I've deleted some columns below just for presentation purposes.
library(dplyr)
mydf %>% mutate(bin = rank(BR))
# range X0 X1 total BR ... Index bin
# 1 (1,23] 5718 194 5912 0.03281461 ... 1.534535 2
# 2 (23,26] 5249 330 5579 0.05915039 ... 1.207544 8
# 3 (26,28] 3105 209 3314 0.06306578 ... 1.292856 10
# 4 (28,33] 6277 416 6693 0.06215449 ... 1.272937 9
# 5 (33,37] 4443 239 4682 0.05104656 ... 1.033207 5
# 6 (37,41] 4277 237 4514 0.05250332 ... 1.064326 7
# 7 (41,46] 4904 265 5169 0.05126717 ... 1.037913 6
# 8 (46,51] 4582 230 4812 0.04779717 ... 1.037198 4
# 9 (51,57] 4039 197 4236 0.04650614 ... 1.067437 3
# 10 (57,76] 3926 105 4031 0.02604813 ... 1.946684 1
If you just want to reorder the rows, use arrange instead:
mydf %>% arrange(BR)
bbT11$Bin[order(bbT11$BR)] <- 1:nrow(bbT11)