Reassigning one column according to another using data.table - r

I am interested in replacing the value of -11 in one column "contra_end" to the corresponding values contained in "current_age", another column. -11 is a variable indicating current activity, and I want to replace that value with the actual age of each individual stored in "current_age". Age has ~500,000 values and only ~4,000 values from the first column have the value -11. When I run the following code to assign my age column values to the -11 values in "contra_end" I get the following error. Can I make this work without creating a new age variable?
biobank[contra_end == -11, contra_end := biobank[,"current_age", with=FALSE]]
Error in `[.data.table`(biobank, contra_end == -11, `:=`(contra_end, biobank[, :
Supplied 500000 items to be assigned to 4919 items of column 'contra_end'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

I used a short dataset which I made using this code
biobank <- data.frame(contra_end = c(0,13,15,109,-11,23,45),
current_age = c(34,35,36,46,43,56,23))
which gives
contra_end current_age
1 0 34
2 13 35
3 15 36
4 109 46
5 -11 43
6 23 56
7 45 23
Using the tidyverse::mutate
biobank_2 <- biobank %>%
mutate(contra_end = ifelse(contra_end == -11, current_age, contra_end))
Or using base
biobank$contra_end[biobank$contra_end==-11] <- biobank$current_age[biobank$contra_end==-11]
Both options give:
contra_end current_age
1 0 34
2 13 35
3 15 36
4 109 46
5 43 43
6 23 56
7 45 23
EDIT: I didn't even notice that you were looking for a solution in data.table until after I posted. It doesn't sound like you have too many records for either of the solutions I posted to not be efficient enough, though.

Related

How to extract rows with similar names into a submatrix?

I am building an asymmetrical matrix of values with the rows being coefficient names and the column the value of each coefficient:
# Set up Row and Column Names.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
output:
> matrix1
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
and I wish to extract certain rows that share row names (i.e. all rows with "unitBreaks'x'") into a submatrix.
I tried this
est_actsBreaks <- est_coef_mtrx[c("actsBreaks0","actsBreaks1","actsBreaks2","actsBreaks3",
"actsBreaks4","actsBreaks5","actsBreaks6","actsBreaks7",
"actsBreaks8","actsBreaks9"),c("Value")]
but it returns a vector and I need a matrix. I have seen other questions concerning similar procedures but their columns and rows all had identical names and/or values. Is there a way to do the operation I have in mind, such as grep()?
Welcome to StackOverflow.
As usual in R, there would probably be many ways to do what you request.
EDIT: I realized that my solution was going a little bit too far, sorry about that.
To extract only the rows that contain the pattern "unitBreaks" followed by several numbers, and still keep a matrix structure, you can run the following code. In a nutshell, grep is going to look for the pattern that you need and the argument drop = FALSE is going to make sure that you get a matrix as a result and not a vector.
uniBreakLines <- grep("unitBreaks[0-9]*", rows)
matrix1[uniBreakLines, , drop = FALSE]
Below is the first version of my answer.
First, I create a vector that describes the groups of rows. For this, I remove the numbers at the end of the row names.
grp <- gsub("[0-9]+$", "", rows)
Then, I transform the data matrix into a data-frame (why I do that is explained a little bit later).
dat1 <- data.frame(matrix1)
Finally, I use "split" on the data-frame, with the groups defined earlier. Using split on the data-frame will keep the structure: the result will be a list of data-frames, even though there is only one column.
dat1.split <- split(dat1, grp)
The result is a list of data-frames.
lapply(dat1.split, head)
$actsBreaks
Value
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
$covgBreaks
Value
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
$Intercept
Value
Intercept 1
$plugBump
Value
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
$tBreaks
Value
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
$unitBreaks
Value
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
$yearBreaks
Value
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
After that, if you still need matrices, you can convert them with the function as.matrix in an "lapply":
matrix1.split <- lapply(dat1.split, as.matrix)
You might want to consider combining your data in a "tibble" with the "grouping" column. You will then be able to use these groups with the group_by function or other functions from the dplyr package (or other packages from the tidyverse).
For example:
library(dplyr)
tib1 <- tibble(rows, simpler_rows, value = 1:48)
And an example on how to use the grouping variable:
tib1 %>%
group_by(simpler_rows) %>%
summarize(sum(value))
EDIT bis: what if I don't know the pattern?
I played around a little bit with your example to answer the question (that nobody asked, but still, it's fun!): "what if I don't know the pattern?"
In this case, I would use a distance between the row names. This distance would look like this:
... and would be the output of the following lines of code
library(stringdist)
library(pheatmap)
strdist <- stringdistmatrix(rows)
pheatmap(strdist, border_color = "white", cluster_rows = F, cluster_cols = FALSE, cellwidth = 10, cellheight = 10, labels_row = rows, fontsize_row = 7)
After that, I only need to get the number of cluster, which can be done with a silhouette plot (similar to this one), that tells me that there are 8 clusters of words, which seems about right:
The cluster can be extracted then with the function used to create the silhouette plot (I used hclust and cutree).
Here a solution with dplyr and stringr to extract rownames that contain a certain string.
At the end change back to matrix:
library(dplyr)
library(stringr)
df1 <- df %>%
filter(!str_detect(rownames(df), "unitBreaks"))
df1 <- as.matrix(df1)
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48

R countif and sum on multiple columns matching elements in specified vector

I am applying this function to my dataset column DL1 on another vector as below and receiving the results expected
table(df$DL1[df$DL1 %in% undefined_dl_codes])
Result:
0 10 30 3B 4 49 54 5A 60 7 78 8 90
24 366 4 3 665 40 1 1 14 8 4 87 1
however I do have columns DL2, DL3 and DL4 which have same data, how can I apply the function to multiple columns and receive the result of all. I would need to go through all 4 required columns and receive 1 result as summary.
Any help highly appreciated!
May not be the best of the methods, however you could do the following
table(c(df$DL1[df$DL1 %in% undefined_dl_codes],
df$DL2[df$DL2 %in% undefined_dl_codes],
df$DL3[df$DL3 %in% undefined_dl_codes],
df$DL4[df$DL4 %in% undefined_dl_codes]
)
)
Using Raghuveer solution I further simplified,
attach(df)
table(c(DL1,DL2,DL3,DL4)[c(DL1,DL2,DL3,DL4) %in% undefined_dl_codes])
detach(df)

Speedup search of Elements

I got two data.frames m (23 columns 135.973 rows) with the two important columns
head(m[,2])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(m[,7])
# [1] 3661216 3661217 3661223 3661224 3661564 3661567
and search (4 columns 1.019.423 rows) with three important columns
head(search[,1])
# [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1"
head(search[,3])
# [1] 3000009 3003160 3003187 3007262 3028947 3050944
head(search[,4])
# [1] 3000031 3003182 3003209 3007287 3028970 3050995
For each row in m I like to get the information if the m[XX,7] position is between any position of search[,3] and search[,4]. So search[,3] can be considered as "start" and search[,4] as "end". In addition search[,1] and m[,2] have to be identical.
Example:
m at row 215
"chr1" 10.984.038
hits in search at line 2898
"chr1" 10.984.024 10.984.046
In general I'm not interested which line or how many lines of search could be found. I just want the information for any line of m is there a matching line in search yes or no.
I'm ending up in this function:
f_4<-function(x,y,z){
for (out in 1:length(x[,1])) {
z[out]<-length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4])))
}
return(z)
}
found4<-vector(length=length(m[,1]), mode="numeric")
found4<-f_4(m,search,found4)
It took 3 hours to run this code.
I have already tried some speedup approaches, however I didn't manage to get any of this running proper or faster.
I even tried some lappy/apply approaches -which worked but aren't faster-. However they failed when trying to speed up using parLapply/parRapply.
Anybody got a quite faster approach and may can give some advise?
EDIT 2015/09/18
Found another way to speed up, using foreach %dopar%.
f5<-function(x,y,z){
foreach(out=1:length(x[,1]), .combine="c") %dopar% {
takt<-1000
z=length(which((y[,1]==x[out,2]) &(x[out,7]>=y[,3]) &(x[out,7]<=y[,4]) ))
}
return(z)
}
found5<-vector(length=length(m[,1]), mode="numeric")
found5<-f5(m,search,found5)
Only need 45min. However I'm always getting 0 only. Thing I need to read some more of the foreach %dopar% tutorials.
You can try merging with subsequent logical subsetting. First let's create some mock data:
set.seed(123) # used for reproducibility
m <-as.data.frame(matrix(sample(50,7000, replace=T), ncol=7, nrow=1000))
search <- as.data.frame(matrix(sample(50,1200, replace=T), ncol=4, nrow=300))
Since we want to compare different rows of the two sets, we can use the criterion that m[,2] should be equal to search[,1]. For convenience we can name these columns "ID" in both sets:
m <- cbind(m,seq_along(1:nrow(m)))
search <- cbind(search,seq_along(1:nrow(search)))
colnames(m) <- c("a","ID","c","d","e","f","val","rownum.m")
colnames(search) <- c("ID","nothing","start","end", "rownum.s")
We have added a column to m named 'rownum.m' and a similar column to search which in the end will help identifying the resulting entries in the initial dataset.
Now we can merge the data sets, such that the ID is the same:
m2 <- merge(m,search)
In a final step, we can perform a logical subset of the merged data set and assign the output to a new data frame m3:
m3 <- m2[(m2[,"val"] >= m2[,"start"]) & (m2[,"val"] <= m2[,"end"]),]
#> head(m3)
# ID a c d e f val rownum.m nothing start end rownum.s
#5 1 14 36 36 31 30 25 846 10 20 36 291
#13 1 34 49 24 8 44 21 526 10 20 36 291
#17 1 19 32 29 44 24 35 522 6 33 48 265
#20 1 19 32 29 44 24 35 522 32 31 50 51
#21 1 19 32 29 44 24 35 522 10 20 36 291
#29 1 6 50 10 13 43 22 15 10 20 36 291
If we are only interested in a TRUE/FALSE statement whether a specific row of m matches the criterions, we can define a vector match_s:
match_s <- m$rownum.m %in% m3$rownum.m
which can be stored as an additional column in the original data set m:
m <- cbind(m,match_s)
Finally, we can remove the auxiliary column 'rownum.m' from the data set m which is no longer needed, with m <- m[,-8].
The result is:
> head(m)
# a ID c d e f val match_s
#1 15 14 8 11 16 13 23 FALSE
#2 40 30 8 48 42 50 20 FALSE
#3 21 9 8 19 30 36 19 TRUE
#4 45 43 26 32 41 33 27 FALSE
#5 48 43 25 10 15 13 4 FALSE
#6 3 24 31 33 8 5 36 FALSE
If you're trying to find SNPs (say) inside a set of genomic regions, don't use R. Use BEDOPS.
Convert your SNP or single-base positions to a three-column BED file. In R, make a three-column data table with m[,2], m[,7] and m[,7] + 1, which represent the chromosome, start and stop position of the SNP. Use write.table() to write out this data table to a tab-delimited text file.
Do the same with your genomic regions: Write search[,1], search[,3], and search[,4] to a three-column data table representing the chromosome, start and stop position of the region. Use write.table() to write this out to a tab-delimited text file.
Use sort-bed to sort both BED files. This step might be optional, but it doesn't take long to do and it guarantees that the files are prepped for use with BEDOPS tools.
Finally, use bedmap on the two BED files to map SNPs to genomic regions. Mapping associates SNPs with regions. The bedmap tool can report which SNPs map to a region, or report the number of SNPs, or one or more of many other operations. The documentation for bedmap goes into more detail on the list of operations, but the provided example should get you started quickly.
If your data are in BED format, or can be quickly coerced into BED format, don't use R for genomic operations, as it is slow and memory-intensive. The BEDOPS toolkit introduced the use of sorting to make genomic operations fast, with low memory overhead.

Error in the output file of a for loops in r

I'm trying to perform a resample of a list using the for loops in R for generating a data frame that records the output of each trial.
I get the for loops to work without error, but I am sure I am making a mistake somewhere as I should not be getting the result for the jth entry that I get as possible outcomes.
Here's how I am generating my list:
set1=rep(0,237) # repeat 0's 237 times
set2=rep(1,33) # repeats 1s 33 times
aa=c(set1,set2) # put the two lists together
table(aa) # just a test count to make sure I have it set up right
Now I want to take a random sample set of size j out of aa and record how many 0's and 1's I get each time I perform this task (let's say n number of trials).
Here's how I have set it up:
n=1000
j=27
output=matrix(0,nrow=2,ncol=n)
for (i in 1:n){
trial<-sample(aa,j,replace=F)
counts=table(trial)
output[,i]=counts
}
Checking the output,
table(output[1,])
# 17 18 19 20 21 22 23 24 25 26 27
1 1 9 17 46 135 214 237 205 111 24
table(output[2,])
# 1 2 3 4 5 6 7 8 9 10 27
111 205 237 214 135 46 17 9 1 1 24
I do not think I am getting the right answer from the distribution for the jth value (in this case 27) for either of the expected number of 0's or 1's (should be close to 0 as oppose to the high number it returns).
Any suggestions as to where I am going wrong would be greatly appreciated.
If you have only 0s in trial length(counts)==1 and the value gets recycled when you assign to output. Try this:
for (i in 1:n){
trial<-sample(aa,j,replace=F)
trial <- factor(trial, levels=0:1)
counts=table(trial)
output[,i]=counts
}
Of course, you could more efficiently use rhyper:
table(rhyper(1000, table(aa)[1], table(aa)[2], 27))

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources