Challenge in data manipulation - r

I am trying to process a database of BLAST output to generate a data frame containing values for a given gene and given sample. When a gene is identified within a sample I would like the scaffold on which it was identified to be reported. If a given gene is NOT identified within a given sample I would like the cell to be filled with N/A.
sample_name scaffold gene_title match_(%)
P24_ST48 64 aadA12 94.56
401B_ST5223 381 blaTEM-163 99.65
P32_ST218 91 aadA24 90.41
HOS66_ST73 9 blaACT-5 72.31
HOS16_ST38 70 blaTEM-146 99.42
HOS56_ST131 48 aadA21 91.39
Ecoli_2009_1_ST131 41 sul1 99.88
PH152_ST95 37 dfrA33 83.94
Ecoli_2009_32_STNT 16 aac(3)-Ib 100.00
PH231_ST38 59 mph(D) 89.83
P44_STNT 135 blaTEM-105 99.88
Ecoli_2011_89_ST127 29 blaTEM-158 99.65
405C_ST1178 120 aadA1 99.75
P3_STNT 15 blaTEM-68 99.19
5A_ST34 174 blaTEM-127 99.88
P27_ST10 211 aph(3')-Ia 100.00
4D_ST767 393 blaTEM-152 98.95
P10_STNT 23 blaTEM-17 99.07
Ecoli_2014_27_ST131 49 sul2_15 99.88
Ecoli_2013_10_ST73 23 blaTEM-2 99.19
The output table would look something like:
Sample aadA1 aadA12 aadA24 blaTEM-163 ...
P24_ST48 N/A 64 N/A N/A
401B_ST5223 N/A N/A N/A 381
...
In excel I have concatenated the sample name and gene titles and reported the scaffold number on row where this string is identified using VLOOKUP - I have tried many different ways in R and am going around in circles.
Now trying to process +700 genes and +450 samples, the list of gene-sample combinations is getting somewhat laborious for excel to manage and I must find another solution with my collection of samples growing increasingly large.
Any help would be greatly appreciated.
Cheers,
Max

Here's how to do that with spread from tidyr
library(tidyr)
df1%>%
spread(key = gene_title,value = scaffold)
sample_name match_... aac(3)-Ib aadA1 aadA12 ...
1 401B_ST5223 99.65 NA NA NA
2 405C_ST1178 99.75 NA 120 NA
3 4D_ST767 98.95 NA NA NA
4 5A_ST34 99.88 NA NA NA
5 Ecoli_2009_1_ST131 99.88 NA NA NA
...
Data
df1 <- read.table(text="sample_name scaffold gene_title match_(%)
P24_ST48 64 aadA12 94.56
401B_ST5223 381 blaTEM-163 99.65
P32_ST218 91 aadA24 90.41
HOS66_ST73 9 blaACT-5 72.31
HOS16_ST38 70 blaTEM-146 99.42
HOS56_ST131 48 aadA21 91.39
Ecoli_2009_1_ST131 41 sul1 99.88
PH152_ST95 37 dfrA33 83.94
Ecoli_2009_32_STNT 16 aac(3)-Ib 100.00
PH231_ST38 59 mph(D) 89.83
P44_STNT 135 blaTEM-105 99.88
Ecoli_2011_89_ST127 29 blaTEM-158 99.65
405C_ST1178 120 aadA1 99.75
P3_STNT 15 blaTEM-68 99.19
5A_ST34 174 blaTEM-127 99.88
P27_ST10 211 aph(3')-Ia 100.00
4D_ST767 393 blaTEM-152 98.95
P10_STNT 23 blaTEM-17 99.07
Ecoli_2014_27_ST131 49 sul2_15 99.88
Ecoli_2013_10_ST73 23 blaTEM-2 99.19",
header=TRUE,stringsAsFactors=FALSE)

We can use dcast from data.table
library(data.table)
dcast(setDT(df1), sample_name + match_... ~ gene_title, value.var = 'scaffold')
# sample_name match_... aac(3)-Ib aadA1 aadA12 ...
#1: 401B_ST5223 99.65 NA NA
#2: 405C_ST1178 99.75 NA 120
#3: 4D_ST767 98.95 NA NA
#4: 5A_ST34 99.88 NA NA

Related

combine two similar columns in r

I'm trying to combine two columns of data that essentially contain the same information but some values are missing from each column that the other doesn't have. Column "wasiIQw1" holds the data for half of the group while column w1iq holds the data or the other half of the group.
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
nidaid wasiIQw1 w1iq
1 45-D11150341 104 NA
2 45-D11180321 82 NA
3 45-D11220022 93 93
4 45-D11240432 118 NA
5 45-D11270422 99 NA
6 45-D11290422 82 82
7 45-D11320321 99 99
8 45-D11500021 99 99
9 45-D11500311 95 95
10 45-D11520011 111 111
select(gadd.us,nidaid,wasiIQw1,w1iq)[384:394,]
nidaid wasiIQw1 w1iq
384 H1900442S NA 62
385 H1930422S NA 83
386 H1960012S NA 89
387 H1960321S NA 90
388 H2020011S NA 96
389 H2020422S NA 102
390 H2040011S NA 102
391 H2040331S NA 94
392 H2040422S NA 103
393 H2050051S NA 86
394 H2050341S NA 98
With the following code I joined df.a (a df with the id and wasiIQw1) with df.b (a df with the id and w1iq) and get the following results.
df.join <- semi_join(df.a,
df.b,
by = "nidaid")
nidaid w1iq
1 45-D11150341 NA
2 45-D11180321 NA
3 45-D11220022 93
4 45-D11240432 NA
5 45-D11270422 NA
6 45-D11290422 82
7 45-D11320321 99
8 45-D11500021 99
9 45-D11500311 95
10 45-D11520011 111
nidaid w1iq
384 H1900442S 62
385 H1930422S 83
386 H1960012S 89
387 H1960321S 90
388 H2020011S 96
389 H2020422S 102
390 H2040011S 102
391 H2040331S 94
392 H2040422S 103
393 H2050051S 86
394 H2050341S 98
All of this works except for the first four "NA"s that won't merge. Other "_join" functions from dplyr have not worked either. Do you have any tips for combining theses two columns so that no data is lost but all "NA"s are filled in if the other column has a present value?
I guess you can use coalesce here which finds the first non-missing value at each position.
library(dplyr)
gadd.us %>% mutate(w1iq = coalesce(w1iq, wasiIQw1))
This will select values from w1iq if present or if w1iq is NA then it would select value from wasiIQw1. You can switch the position of w1iq and wasiIQw1 if you want to give priority to wasiIQw1.
Here would be a way to do it with base R (no packages)
Create reproducible data:
> dat<-data.frame(nidaid=paste0("H",c(1:5)), wasiIQw1=c(NA,NA,NA,75,9), w1iq=c(44,21,46,75,NA))
>
> dat
nidaid wasiIQw1 w1iq
1 H1 NA 44
2 H2 NA 21
3 H3 NA 46
4 H4 75 75
5 H5 9 NA
Create a new column named new to combine the two. With this ifelse statement, we say if the first column wasiIQw1 is not (!) an 'NA' (is.na()), then grab it, otherwise grab the second column. Similar to Ronak's answer, you can switch the column names here to give one preference over the other.
> dat$new<-ifelse(!is.na(dat$wasiIQw1), dat$wasiIQw1, dat$w1iq)
>
> dat
nidaid wasiIQw1 w1iq new
1 H1 NA 44 44
2 H2 NA 21 21
3 H3 NA 46 46
4 H4 75 75 75
5 H5 9 NA 9
Using base R, we can do
gadd.us$w1iq <- with(gadd.us, pmax(w1iq, wasiIQw1, na.rm = TRUE))

How to sum rows by group in a big datafram?

My data (crsp.daily) look roughly like this (the numbers are made up and there are more variables):
PERMCO PERMNO date price VOL SHROUT
103 201 19951006 8.8 100 823
103 203 19951006 7.9 200 1002
1004 10 19951006 5 277 398
2 5 19951110 5.3 NA 579
1003 2 19970303 10 67 NA
1003 1 19970303 11 77 1569
1003 20 19970401 6.7 NA NA
I want to sum VOL and SHROUT by groups defined by PERMCO and date, but leaving the original number of rows unchanged, thus my desired output is the following:
PERMCO PERMNO date price VOL SHROUT VOL.sum SHROUT.sum
103 201 19951006 8.8 100 823 300 1825
103 203 19951006 7.9 200 1002 300 1825
1004 10 19951006 5 277 398 277 398
2 5 19951110 5.3 NA 579 NA 579
1003 2 19970303 10 67 NA 21 1569
1003 1 19970303 11 77 1569 21 1569
1003 20 19970401 6.7 NA NA NA NA
My data have more than 45 millions of observations, and 8 columns. I have tried using ave:
crsp.daily$VOL.sum=ave(crsp.daily$VOL,c("PERMCO","date"),FUN=sum)
or sapply:
crsp.daily$VOL.sum=sapply(crsp.daily[,"VOL"],ave,crsp.daily$PERMCO,crsp.daily$date)
The problem is that it takes an infinite amount of time (like more than 30 min and I still did not see the result). Another thing that I tried was to create a variable called "group" by pasting PERMCO and date like this:
crsp.daily$group=paste0(crsp.daily$PERMCO,crsp.daily$date)
and then apply ave using crsp.daily$group as groups. This also did not work because from a certain observation on, R did not distinguish anymore the different values of crsp.daily$groups and treated them as a unique group.
The solution of creating the variable "groups" worked on a smaller dataset.
Any advise is greatly appreciated!
With data.table u could use the following code
require(data.table)
dt <- as.data.table(crsp.daily)
dt[, VOL.sum := sum(VOL), by = list(PERMCO, date)]
With the command := u create a new variable (VOL.sum) and group those by PERMCO and date.
Output
permco permno date price vol shrout vol.sum
1 103 201 19951006 8.8 100 823 300
2 103 203 19951006 7.9 200 1002 300
3 1004 10 19951006 5.0 277 398 277
4 2 5 19951110 5.3 NA 579 NA

Adding data frame below another data frame [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 5 years ago.
I want to do the following:
I have a Actual Sales Dataframe
Dates Actual
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
Another data frame of Predicted values
Dates Predicted
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
Add the predicted Sales data frame below the Actual data Frame in following manner:
Dates Actual Predicted
24/04/2017 58
25/04/2017 59
26/04/2017 58
27/04/2017 154
28/04/2017 117
29/04/2017 127
30/04/2017 178
01/05/2017 68.54159
02/05/2017 90.7313
03/05/2017 82.76875
04/05/2017 117.48913
05/05/2017 110.3809
06/05/2017 156.53363
07/05/2017 198.14819
With:
library(dplyr)
bind_rows(d1, d2)
you get:
Dates Actual Predicted
1 24/04/2017 58 NA
2 25/04/2017 59 NA
3 26/04/2017 58 NA
4 27/04/2017 154 NA
5 28/04/2017 117 NA
6 29/04/2017 127 NA
7 30/04/2017 178 NA
8 01/05/2017 NA 68.54159
9 02/05/2017 NA 90.73130
10 03/05/2017 NA 82.76875
11 04/05/2017 NA 117.48913
12 05/05/2017 NA 110.38090
13 06/05/2017 NA 156.53363
14 07/05/2017 NA 198.14819
Or with:
library(data.table)
rbindlist(list(d1,d2), fill = TRUE)
Or with:
library(plyr)
rbind.fill(d1,d2)

How to process multi columns data in data.frame with plyr

I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this(I store the CSV file in my GoogleDrive Link ) :
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees,I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!
I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20

Globaltest Pathway analysis with a matrix

I have a matrix with SAGE count data and i want to test for GO enrichment en Pathway enrichment. Therefore I want to use the globaltest in R. My data looks like this:
data_file
KI_1 KI_2 KI_4 KI_5 KI_6 WT_1 WT_2 WT_3 WT_4 WT_6
ENSMUSG00000002012 215 141 102 127 138 162 164 114 188 123
ENSMUSG00000028182 13 5 13 12 8 10 7 13 7 14
ENSMUSG00000002017 111 72 70 170 52 87 117 77 226 122
ENSMUSG00000028184 547 312 162 226 280 501 603 407 355 268
ENSMUSG00000002015 1712 1464 825 1038 1189 1991 1950 1457 1240 883
ENSMUSG00000028180 1129 944 766 869 737 1223 1254 865 871 844
The rownames contains ensembl gene IDs and each column represent a sample. These samples can be divided in two groups for testing pathway enrichment: KI1 and the WT2 group
groups <- c("KI1","KI1","KI1","KI1","KI1","WT2","WT2","WT2","WT2","WT2")
I found the function gtKEGG to do the pathway analysis, but my question is how? Because when I run the function I don't create any error but my output file is like this:
> gtKEGG(groups, t(data_file), annotation="org.Mm.eg.db")
holm alias p-value Statistic Expected Std.dev #Cov
00380 NA Tryptophan metabolism NA NA NA NA 0
01100 NA Metabolic pathways NA NA NA NA 0
02010 NA ABC transporters NA NA NA NA 0
04975 NA Fat digestion and absorption NA NA NA NA 0
04142 NA Lysosome NA NA NA NA 0
04012 NA ErbB signaling pathway NA NA NA NA 0
04110 NA Cell cycle NA NA NA NA 0
04360 NA Axon guidance NA NA NA NA 0
Can anyone help me with this question? Thanks! :)
I found the solution!
library(globaltest)
library(org.Mm.eg.db)
eg <- as.list(org.Mm.egENSEMBL2EG)
KEGG<-gtKEGG(as.factor(groups), t(data_file), probe2entrez= eg, annotation="org.Mm.eg.db")

Resources