I am sure this is a super easy answer but I am struggling with how to add a column with two different variables to my dataframe. Currently, this is what it looks like
vcv.index model.index par.index grid index estimate se lcl ucl fixed
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157
5 10 10 20 A 20 0.7575811 0.05033490 0.6461758 0.8424612
6 21 21 61 B 61 0.8713467 0.07638687 0.6404598 0.9626184
7 22 22 62 B 62 0.6074379 0.06881230 0.4677827 0.7314827
8 23 23 63 B 63 0.6041054 0.06107520 0.4805279 0.7156792
9 24 24 64 B 64 0.5806565 0.06927308 0.4422237 0.7074601
10 25 25 65 B 65 0.7370944 0.05892108 0.6070620 0.8357394
11 41 41 121 C 121 0.8048479 0.09684385 0.5519097 0.9324759
12 42 42 122 C 122 0.5259547 0.07165218 0.3871380 0.6608721
13 43 43 123 C 123 0.5427100 0.07127273 0.4033255 0.6757137
14 44 44 124 C 124 0.5168820 0.06156392 0.3975561 0.6343132
15 45 45 125 C 125 0.6550049 0.07378403 0.5002851 0.7826343
16 196 196 586 A 586 0.8536314 0.08709394 0.5979992 0.9580976
17 197 197 587 A 587 0.5672194 0.07079508 0.4268452 0.6975725
18 198 198 588 A 588 0.5675415 0.06380445 0.4408540 0.6859714
19 199 199 589 A 589 0.5666874 0.06499899 0.4377071 0.6872233
20 200 200 590 A 590 0.7058542 0.05985868 0.5769484 0.8085177
21 211 211 631 B 631 0.8360614 0.09413427 0.5703031 0.9514472
22 212 212 632 B 632 0.5432872 0.07906200 0.3891364 0.6895701
23 213 213 633 B 633 0.5400994 0.06497607 0.4129055 0.6622759
24 214 214 634 B 634 0.5161692 0.06292706 0.3943257 0.6361202
25 215 215 635 B 635 0.6821667 0.07280044 0.5263841 0.8056298
26 226 226 676 C 676 0.7621875 0.10484478 0.5077465 0.9087471
27 227 227 677 C 677 0.4607440 0.07326970 0.3240229 0.6036386
28 228 228 678 C 678 0.4775168 0.08336433 0.3219349 0.6375872
29 229 229 679 C 679 0.4517655 0.06393339 0.3319262 0.5774725
30 230 230 680 C 680 0.5944330 0.07210672 0.4491995 0.7248303
then I am adding a column with periods 1-5 repeated until reaches the end
with this code
SurJagPred$estimates %<>% mutate(Primary = rep(1:5, 6))
and I also need to add sex( F, M) as well. the numbers 1-15 are female and the 16-30 are male. So overall it should look like this.
> vcv.index model.index par.index grid index estimate se lcl ucl fixed Primary Sex
F
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751 1 F
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014 2 F
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169 3 F
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157 4 F
We can use rep with each on a vector of values to replicate each element of the vector to that many times
SurJagPred$estimates %<>%
mutate(Sex = rep(c("F", "M"), each = 15))
Related
I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
I am using a code based on Deseq2. One of my goals is to plot a heatmap of data.
heatmap.data <- counts(dds)[topGenes,]
The error I am getting is
Error in counts(dds)[topGenes, ]: subscript out of bounds
the first few line sof my counts(dds) function looks like this.
99h1 99h2 99h3 99h4 wth1 wth2
ENSDARG00000000002 243 196 187 117 91 96
ENSDARG00000000018 42 55 53 32 48 48
ENSDARG00000000019 91 91 108 64 95 94
ENSDARG00000000068 3 10 10 10 30 21
ENSDARG00000000069 55 47 43 53 51 30
ENSDARG00000000086 46 26 36 18 37 29
ENSDARG00000000103 301 289 289 199 347 386
ENSDARG00000000151 18 19 17 14 22 19
ENSDARG00000000161 16 17 9 19 10 20
ENSDARG00000000175 10 9 10 6 16 12
ENSDARG00000000183 12 8 15 11 8 9
ENSDARG00000000189 16 17 13 10 13 21
ENSDARG00000000212 227 208 259 234 78 69
ENSDARG00000000229 68 72 95 44 71 64
ENSDARG00000000241 71 92 67 76 88 74
ENSDARG00000000324 11 9 6 2 8 9
ENSDARG00000000370 12 5 7 8 0 5
ENSDARG00000000394 390 356 339 283 313 286
ENSDARG00000000423 0 0 2 2 7 1
ENSDARG00000000442 1 1 0 0 1 1
ENSDARG00000000472 16 8 3 5 7 8
ENSDARG00000000476 2 1 2 4 6 3
ENSDARG00000000489 221 203 169 144 84 114
ENSDARG00000000503 133 118 139 89 91 112
ENSDARG00000000529 31 25 17 26 15 24
ENSDARG00000000540 25 17 17 10 28 19
ENSDARG00000000542 15 9 9 6 15 12
How do I ensure all the elements of the top genes are present in it?
When I try to see 20 top genes in the dataset. it looks like a list of genes
6339" "12416" "1241" "3025" "12791" "846" "15090"
[8] "6529" "14564" "4863" "12777" "1122" "7454" "13716"
[15] "5790" "3328" "1231" "13734" "2797" "9072" with the column head V1.
I have used both
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = TRUE)
and
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = FALSE)
to see if the out of bounds error is removed. However it was of no use. I guess the V1 head is causing the issue.
The top genes function has been generated using the above code snippet.
resordered <- res[order(res$padj),]
#Reorder gene list by increasing pAdj
resordered <- as.data.frame(res[order(res$padj),])
#Filter for genes that are differentially expressed with an FDR < 0.01
ii <- which(res$padj < 0.01)
length(ii)
# Use the rownames() function to get the top 20 differentially expressed genes from our results table
topGenes <- rownames(resordered[1:20,])
topGenes
# Get the counts from the DESeqDataSet using the counts() function
heatmap.data <- counts(dds)[topGenes,]
Perhaps this will do what you want?
counts_dds <- counts(dds)
topgenes <- c("ENSDARG00000000002", "ENSDARG00000000489", "ENSDARG00000000503",
"ENSDARG00000000540", "ENSDARG00000000529", "ENSDARG00000000542")
heatmap.data <- counts_dds[rownames(counts_dds) %in% topgenes,]
If you provide more information it will be easier to advise you on how to fix your problem.
I need to duplicate those levels whose frequency in my factor variable called groups is less than 500.
> head(groups)
[1] 0000000 1000000 1000000 1000000 0000000 0000000
75 Levels: 0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 ... 1111110
For example:
> table(group)
group
0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 0011000 0011010 0011100
58674 6 1033 654 223 1232 31 222 17 818 132 32 15 42 9 9
0011110 0100000 0100001 0100010 0100100 0100101 0100110 0101000 0101010 0101100 0101110 0110000 0110010 0110100 0110110 0111000
1 10609 1 487 64 1 58 132 11 12 3 142 27 9 7 11
0111010 0111100 0111110 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1001000 1001001 1001010 1001100 1001110 1010000
5 1 2 54245 10 1005 1 329 1 138 573 1 31 71 11 969
1010010 1010100 1010110 1011000 1011010 1011100 1011110 1100000 1100001 1100010 1100011 1100100 1100110 1101000 1101010 1101011
147 29 21 63 15 10 4 14161 6 770 1 142 96 260 23 1
1101100 1101110 1110000 1110001 1110010 1110100 1110110 1111000 1111010 1111100 1111110
34 16 439 2 103 13 26 36 13 8 5
Groups 0000001, 0000110, 0001010, 0001100... must be duplicated up to 500.
The ideal would be to have a "sample balanced data" of groups that duplicate those levels often less than 500 and penalize the rest (Levels more than 500 frequency) until reaching 500.
We can use rep on the levels of 'group' for the desired 'n'
factor(rep(levels(group), each = n))
If we need to use the table results as well
factor(rep(levels(group), table(group) + n-table(group)) )
Or with pmax
factor(rep(levels(group), pmax(n, table(levels(group)))))
data
set.seed(24)
group <- factor(sample(letters[1:6], 3000, replace = TRUE))
n <- 500
I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.
I have a dataset in a given format:
USER.ID avgfrequency
1 3 3.7821782
2 7 14.7500000
3 9 13.4761905
4 13 5.1967213
5 16 6.7812500
6 26 41.7500000
7 49 13.6666667
8 50 7.0000000
9 51 1.0000000
10 52 17.7500000
11 69 4.5000000
12 75 9.9500000
13 91 84.2000000
14 98 8.0185185
15 138 14.2000000
16 139 34.7500000
17 149 7.6666667
18 155 35.3333333
19 167 24.0000000
20 170 7.3529412
21 171 4.4210526
22 175 6.5781250
23 176 19.2857143
24 177 10.4864865
25 178 28.0000000
26 180 4.8461538
27 183 25.5000000
28 184 13.0000000
29 210 32.0000000
30 215 13.4615385
31 220 11.3611111
32 223 26.2500000
I want to first sort the dataset by avgfrequency and then I want to plot count of USER.ID's that fall under different bin categories.
I want to divide avgfrequency into different bin categories of width 10.
I am trying to sort data using:
user_avgfrequency <- user_avgfrequency[order(user_avgfrequency[,1]), ]
but getting an error.
df <- data.frame(USER.ID=c(3,7,9,13,16,26,49,50,51,52,69,75,91,98,138,139,149,155,167,170,171,175,176,177,178,180,183,184,210,215,220,223), avgfrequency=c(3.7821782,14.7500000,13.4761905,5.1967213,6.7812500,41.7500000,13.6666667,7.0000000,1.0000000,17.7500000,4.5000000,9.9500000,84.2000000,8.0185185,14.2000000,34.7500000,7.6666667,35.3333333,24.0000000,7.3529412,4.4210526,6.5781250,19.2857143,10.4864865,28.0000000,4.8461538,25.5000000,13.0000000,32.0000000,13.4615385,11.3611111,26.2500000) );
breaks <- seq(0,ceiling(max(df$avgfrequency)/10)*10,10);
cols <- colorRampPalette(c('blue','green','red'))(length(breaks)-1);
hist(df$avgfrequency,breaks,col=cols,axes=F,xlab='Average Frequency',ylab='Count');
axis(1,breaks);
axis(2,0:max(tabulate(cut(df$avgfrequency,breaks))));