Find rows with certain combination of values - r

I have a data frame that looks like this
iso_o iso_d FLOW FLOW_0
185 190 NA NA
185 190 NA NA
185 190 NA NA
185 190 1 NA
185 190 NA NA
185 190 NA 4249
185 114 1 NA
Now I want to know which rows and the number of rows that have for example "185" in iso_o and "190" in iso_d.
Can anyone point me in the right direction?

We can try subset
> subset(df, iso_o == 185 & iso_d == 190)
iso_o iso_d FLOW FLOW_0
1 185 190 NA NA
2 185 190 NA NA
3 185 190 NA NA
4 185 190 1 NA
5 185 190 NA NA
6 185 190 NA 4249

You can find the index with the which-function:
which(data$iso_o == 185 & data$iso_d == 190)
Using brackets might make it a bit easier to read:
which( (data$iso_o == 185) & (data$iso_d == 190) )

Related

combine two similar columns in r

I'm trying to combine two columns of data that essentially contain the same information but some values are missing from each column that the other doesn't have. Column "wasiIQw1" holds the data for half of the group while column w1iq holds the data or the other half of the group.
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
select(gadd.us,nidaid,wasiIQw1,w1iq)[1:10,]
nidaid wasiIQw1 w1iq
1 45-D11150341 104 NA
2 45-D11180321 82 NA
3 45-D11220022 93 93
4 45-D11240432 118 NA
5 45-D11270422 99 NA
6 45-D11290422 82 82
7 45-D11320321 99 99
8 45-D11500021 99 99
9 45-D11500311 95 95
10 45-D11520011 111 111
select(gadd.us,nidaid,wasiIQw1,w1iq)[384:394,]
nidaid wasiIQw1 w1iq
384 H1900442S NA 62
385 H1930422S NA 83
386 H1960012S NA 89
387 H1960321S NA 90
388 H2020011S NA 96
389 H2020422S NA 102
390 H2040011S NA 102
391 H2040331S NA 94
392 H2040422S NA 103
393 H2050051S NA 86
394 H2050341S NA 98
With the following code I joined df.a (a df with the id and wasiIQw1) with df.b (a df with the id and w1iq) and get the following results.
df.join <- semi_join(df.a,
df.b,
by = "nidaid")
nidaid w1iq
1 45-D11150341 NA
2 45-D11180321 NA
3 45-D11220022 93
4 45-D11240432 NA
5 45-D11270422 NA
6 45-D11290422 82
7 45-D11320321 99
8 45-D11500021 99
9 45-D11500311 95
10 45-D11520011 111
nidaid w1iq
384 H1900442S 62
385 H1930422S 83
386 H1960012S 89
387 H1960321S 90
388 H2020011S 96
389 H2020422S 102
390 H2040011S 102
391 H2040331S 94
392 H2040422S 103
393 H2050051S 86
394 H2050341S 98
All of this works except for the first four "NA"s that won't merge. Other "_join" functions from dplyr have not worked either. Do you have any tips for combining theses two columns so that no data is lost but all "NA"s are filled in if the other column has a present value?
I guess you can use coalesce here which finds the first non-missing value at each position.
library(dplyr)
gadd.us %>% mutate(w1iq = coalesce(w1iq, wasiIQw1))
This will select values from w1iq if present or if w1iq is NA then it would select value from wasiIQw1. You can switch the position of w1iq and wasiIQw1 if you want to give priority to wasiIQw1.
Here would be a way to do it with base R (no packages)
Create reproducible data:
> dat<-data.frame(nidaid=paste0("H",c(1:5)), wasiIQw1=c(NA,NA,NA,75,9), w1iq=c(44,21,46,75,NA))
>
> dat
nidaid wasiIQw1 w1iq
1 H1 NA 44
2 H2 NA 21
3 H3 NA 46
4 H4 75 75
5 H5 9 NA
Create a new column named new to combine the two. With this ifelse statement, we say if the first column wasiIQw1 is not (!) an 'NA' (is.na()), then grab it, otherwise grab the second column. Similar to Ronak's answer, you can switch the column names here to give one preference over the other.
> dat$new<-ifelse(!is.na(dat$wasiIQw1), dat$wasiIQw1, dat$w1iq)
>
> dat
nidaid wasiIQw1 w1iq new
1 H1 NA 44 44
2 H2 NA 21 21
3 H3 NA 46 46
4 H4 75 75 75
5 H5 9 NA 9
Using base R, we can do
gadd.us$w1iq <- with(gadd.us, pmax(w1iq, wasiIQw1, na.rm = TRUE))

Efficient way to add multiple columns to weekly data in data.table, based on other values of columns

I have data with this structure:
a <- data.table(week = 1:52, price = 101:152)
a <- a[rep(1:nrow(a), each = 12),]
a$index_in_week <- 1:12
How do I efficiently create 12 new columns that will hold values of prices for next 12 weeks? So, for each week we have 12 rows of data, with index column by week, so it's always in range(1,12). The new columns should contain prices of following 12 weeks starting from current, with a step of 1 week. For example, for week 1 the first new column will have prices of week 1 to 12, column 2 will have values of week 2 to 13, and so on.
I.e., here is how one can create the first two columns:
a$price_for_week_1 <- apply(a, 1, function(y) {
return(head(a[week == (y[[1]]+y[[3]]-1), price], 1))
})
a$price_for_week_2 <- apply(a, 1, function(y) {
return(head(a[week == (y[[1]]+y[[3]]+0), price], 1))
})
Here is an example of a for loop:
for (i in 1:12) {
inside_i <- -2+i
a[, paste0('PRICE_WEEK_', i) := apply(a, 1, function(y) {
return(head(a[week == (y[[1]]+y[[3]] + inside_i), price], 1))
})]
}
The ways to do it as I see it (e.g. for loop or apply family) consumes too much time, and I need efficiency.
What would be the way with data.table or maybe, as all columns are integer, some funky matrix operations?
P.s. I couldn't come up with better title, my apologies.
If I understand correctly, the OP wants to create a table for 52 weeks (rows) where the prices for the subsequent 12 weeks are printed horizontally.
For this, it is not necessary to create a data.table of 12 x 52 = 624 rows and an index_in_week helper column. docendo discimus has suggested to apply the shift() function on the enlarged (624 rows) data.table.
Instead, the shift() function can be applied directly to the data.table which contains weeks and prices (52 rows).
library(data.table)
a <- data.table(week = 1:52, price = 101:152)
print(a, nrows = 20L)
week price
1: 1 101
2: 2 102
3: 3 103
4: 4 104
5: 5 105
---
48: 48 148
49: 49 149
50: 50 150
51: 51 151
52: 52 152
a[, sprintf("wk%02i", 1:12) := shift(price, n = 0:11, type = "lead")]
print(a, nrows = 20L)
week price wk01 wk02 wk03 wk04 wk05 wk06 wk07 wk08 wk09 wk10 wk11 wk12
1: 1 101 101 102 103 104 105 106 107 108 109 110 111 112
2: 2 102 102 103 104 105 106 107 108 109 110 111 112 113
3: 3 103 103 104 105 106 107 108 109 110 111 112 113 114
4: 4 104 104 105 106 107 108 109 110 111 112 113 114 115
5: 5 105 105 106 107 108 109 110 111 112 113 114 115 116
---
48: 48 148 148 149 150 151 152 NA NA NA NA NA NA NA
49: 49 149 149 150 151 152 NA NA NA NA NA NA NA NA
50: 50 150 150 151 152 NA NA NA NA NA NA NA NA NA
51: 51 151 151 152 NA NA NA NA NA NA NA NA NA NA
52: 52 152 152 NA NA NA NA NA NA NA NA NA NA NA

How to count rows in a logical vector

I have a data frame called source that looks something like this
185 2002-07-04 NA NA 20
186 2002-07-05 NA NA 20
187 2002-07-06 NA NA 20
188 2002-07-07 14.400 0.243 20
189 2002-07-08 NA NA 20
190 2002-07-09 NA NA 20
191 2002-07-10 NA NA 20
192 2002-07-11 NA NA 20
193 2002-07-12 NA NA 20
194 2002-07-13 4.550 0.296 20
195 2002-07-14 NA NA 20
196 2002-07-15 NA NA 20
197 2002-07-16 NA NA 20
198 2002-07-17 NA NA 20
199 2002-07-18 NA NA 20
200 2002-07-19 NA 0.237 20
and when I try
> nrow(complete.cases(source))
I only get NULL
can someone explain why this is the case and how can I count how many rows there are without NA or NaN values?
Instead use sum. Though the safest option would be NROW (because it can handle both data.frams and vectors)
sum(complete.cases(source))
#[1] 2
Or alternatively if you insist on using nrow
nrow(source[complete.cases(source), ])
#[1] 2
Explanation: complete.cases returns a logical vector indicating which cases (in your case rows) are complete.
Sample data
source <- read.table(text =
"185 2002-07-04 NA NA 20
186 2002-07-05 NA NA 20
187 2002-07-06 NA NA 20
188 2002-07-07 14.400 0.243 20
189 2002-07-08 NA NA 20
190 2002-07-09 NA NA 20
191 2002-07-10 NA NA 20
192 2002-07-11 NA NA 20
193 2002-07-12 NA NA 20
194 2002-07-13 4.550 0.296 20
195 2002-07-14 NA NA 20
196 2002-07-15 NA NA 20
197 2002-07-16 NA NA 20
198 2002-07-17 NA NA 20
199 2002-07-18 NA NA 20
200 2002-07-19 NA 0.237 20")
complete.cases returns a logical vector that indicates the rows which are complete. As a vector doesn't have a row attribute, you cannot use nrow here, but as suggested by others sum. With sum the TRUE and FALSE are transformed to 1 and 0 internally, so using sum counts the TRUE values of your vector.
sum(complete.cases(source))
# [1] 2
If you however are more interested in the data.frame, which is left after you exclude all non-complete rows, you can use na.exclude. This returns a data.frame and you can use nrow.
nrow(na.exclude(source))
# [1] 2
na.exclude(source)
# V2 V3 V4 V5
# 188 2002-07-07 14.40 0.243 20
# 194 2002-07-13 4.55 0.296 20
You can even try:
source[rowSums(is.na(source))==0,]
# V1 V2 V3 V4 V5
# 4 188 2002-07-07 14.40 0.243 20
# 10 194 2002-07-13 4.55 0.296 20
nrow(source[rowSums(is.na(source))==0,])
#[1] 2

Apply function to dataframe based on unique values

I need to apply a function to a dataframe, subsetted or grouped by unique values.
My data looks like this:
FID FIX_NO ELK_ID ALTITUDE XLOC YLOC DATE_TIME JulDate
1 NA 5296 393 2260.785 547561.3 4771900 NA 140
2 NA 5297 393 2254.992 547555.9 4771906 NA 140
3 NA 5298 393 2256.078 547563.5 4771901 NA 140
4 NA 5299 393 2247.047 547564.7 4771907 NA 140
5 NA 5300 393 2264.875 547558.3 4771903 NA 140
6 NA 5301 393 2259.496 547554.1 4771925 NA 140
...
24247 NA 4389 527 2204.047 558465.7 4775358 NA 161
24248 NA 4390 527 2279.078 558884.1 4775713 NA 161
24249 NA 4391 527 2270.590 558807.9 4775825 NA 161
24250 NA 4392 527 2265.258 558732.2 4775805 NA 161
24251 NA 4393 527 2238.375 558672.4 4775781 NA 161
24252 NA 4394 527 2250.055 558686.6 4775775 NA 161
My goal is to make a new data.frame by randomly selecting 4 rows per each JulDate for each unique ELK_ID.
If I do it by hand, for each unique ELK_ID my code is as follows:
oneelk <- subset(dataset, ELK_ID == 393)
newdata <- do.call(rbind,lapply(split(oneelk,oneelk$JulDate),
function(x)x[sample(1:nrow(x),4),]))
There are >40 ELK_IDs, so I need to automate the process. Please help!
Here is a data.table solution.
library(data.table)
setDT(dataset)[,.SD[sample(.N,4)],by=list(ELK_ID,JulDate)]
# ELK_ID JulDate FID FIX_NO ALTITUDE XLOC YLOC DATE_TIME
# 1: 393 140 NA 5297 2254.992 547555.9 4771906 NA
# 2: 393 140 NA 5299 2247.047 547564.7 4771907 NA
# 3: 393 140 NA 5298 2256.078 547563.5 4771901 NA
# 4: 393 140 NA 5300 2264.875 547558.3 4771903 NA
# 5: 527 161 NA 4394 2250.055 558686.6 4775775 NA
# 6: 527 161 NA 4392 2265.258 558732.2 4775805 NA
# 7: 527 161 NA 4390 2279.078 558884.1 4775713 NA
# 8: 527 161 NA 4393 2238.375 558672.4 4775781 NA
NB, this will only work if there are at least 4 rows for every combination of ELK_ID and JulDate.
You can also create an index using tapply and then just subset (assuming your data set called df)
indx <- unlist(tapply(seq_len(dim(df)[1L]),
df[, c("JulDate", "ELK_ID")],
function(x) sample(x, 4)))
df[indx, ]
Try to split using both columns, maybe split(dataset, dataset[, c("ELK_ID", "JulDate")])
Might as well add a dplyr solution too:
library(dplyr)
newdf <- yourdata %>%
group_by(ELK_ID, JulDate) %>%
sample_n(4)

Globaltest Pathway analysis with a matrix

I have a matrix with SAGE count data and i want to test for GO enrichment en Pathway enrichment. Therefore I want to use the globaltest in R. My data looks like this:
data_file
KI_1 KI_2 KI_4 KI_5 KI_6 WT_1 WT_2 WT_3 WT_4 WT_6
ENSMUSG00000002012 215 141 102 127 138 162 164 114 188 123
ENSMUSG00000028182 13 5 13 12 8 10 7 13 7 14
ENSMUSG00000002017 111 72 70 170 52 87 117 77 226 122
ENSMUSG00000028184 547 312 162 226 280 501 603 407 355 268
ENSMUSG00000002015 1712 1464 825 1038 1189 1991 1950 1457 1240 883
ENSMUSG00000028180 1129 944 766 869 737 1223 1254 865 871 844
The rownames contains ensembl gene IDs and each column represent a sample. These samples can be divided in two groups for testing pathway enrichment: KI1 and the WT2 group
groups <- c("KI1","KI1","KI1","KI1","KI1","WT2","WT2","WT2","WT2","WT2")
I found the function gtKEGG to do the pathway analysis, but my question is how? Because when I run the function I don't create any error but my output file is like this:
> gtKEGG(groups, t(data_file), annotation="org.Mm.eg.db")
holm alias p-value Statistic Expected Std.dev #Cov
00380 NA Tryptophan metabolism NA NA NA NA 0
01100 NA Metabolic pathways NA NA NA NA 0
02010 NA ABC transporters NA NA NA NA 0
04975 NA Fat digestion and absorption NA NA NA NA 0
04142 NA Lysosome NA NA NA NA 0
04012 NA ErbB signaling pathway NA NA NA NA 0
04110 NA Cell cycle NA NA NA NA 0
04360 NA Axon guidance NA NA NA NA 0
Can anyone help me with this question? Thanks! :)
I found the solution!
library(globaltest)
library(org.Mm.eg.db)
eg <- as.list(org.Mm.egENSEMBL2EG)
KEGG<-gtKEGG(as.factor(groups), t(data_file), probe2entrez= eg, annotation="org.Mm.eg.db")

Resources