Why can't I see my full rbindlist result? - r

I used the rbindlist() function to try and merge two melted data frames (means_melt and means_melt_50). I'm wondering why it comes up with the break in the data? And whether I can use the whole list as I ultimately intend to create two graphs, each with 5 sets of data (grouped by variable), and using facet_grid(). I want the two graphs separated based on "Accuracy".
> compiled_means <- list(means_melt, means_melt_50)
> rbindlist(compiled_means, use.names = TRUE, fill=FALSE, idcol = NULL)
Divisions Accuracy variable value
1: 1 0 mean20 16
2: 2 0 mean20 20
3: 3 0 mean20 21
4: 4 0 mean20 17
5: 5 0 mean20 20
---
196: 16 50 mean_2 2
197: 17 50 mean_2 2
198: 18 50 mean_2 2
199: 19 50 mean_2 4
200: 20 50 mean_2 3
If anyone has a more efficient way for me to format the data so that it can be put in the graphs I want, I'm happy to hear suggestions. I'm not sure if the route I'm taking if effective or long-winded...

Simply a matter of preferences and options - by default the function shows a summary of data.tables that have >100 rows. The following direct print gives the full data table.
print(your.data.table, nrows = Inf)
https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#only-the-first-10-rows-are-printed-how-do-i-print-more

Related

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

r - lapply divides a column by an integer value from different dataset, unexpected result

I have two data.frames, one with genotype counts and one with a number that I need to normalize my counts from the first dataset.
countsdata=data.frame(genotype1=rep(c(10,20,30,40),each=1),
genotype2=rep(c(100,200,300,400),each=1),
genotype3=rep(c(40,50,60,70),each=1),
genotype4=rep(c(40,50,60,70),each=1)
)
coldata = data.frame(Group =c('genotype1', 'genotype2', 'genotype3', 'genotype4'),
Treatment = rep(c("control","treated"),each = 2),
Norm=rep(c(1,2,5,5)))
I made sure my variables don't have factors
factorsCharacter <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
as.character))
coldata=factorsCharacter(coldata)
Then I see that lapply loops through my counts, one column at the time and through my coldata that contains the normalization value (Norm). All is looking good, until I combined the two action in the same step
> lapply(coldata['Group'],function(group_i){group_i})
$Group
[1] "genotype1" "genotype2" "genotype3" "genotype4"
> lapply(coldata['Group'],function(group_i){countsdata[,group_i]})
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 20 200 50 50
3 30 300 60 60
4 40 400 70 70
> lapply(coldata['Group'],function(group_i){as.integer(coldata[coldata$Group==group_i,'Norm'])})
$Group
[1] 1 2 5 5
> lapply(coldata['Group'],function(group_i){
+ countsdata[,group_i]/as.integer(coldata[coldata$Group==group_i,'Norm'])
+ })
$Group
genotype1 genotype2 genotype3 genotype4
1 10 100 40 40
2 10 100 25 25
3 6 60 12 12
4 8 80 14 14
Here the result is not what I was expecting (dividing each column by its normalization number). After further inspection I noticed it's normalizing by rows, in other words it's normalizing across different columns, which shouldn't be the case as I am looping through one column at the time. I am probably missing a basic concept but looking through other SO posts didn't find anything I could use. My goal is to fix the code to make the right calculation but I also would like to understand why this code above is not working. Thanks so much.
The problem is in using [ and not [[. So, instead of looping through each of the elements in 'Group' column, we have a list of length 1 with all the elements. So, either use coldata[, 'Group'] or coldata[['Group']] or coldata$Group for looping.
countsdataNew <- countsdata
countsdataNew[] <- lapply(coldata[['Group']],function(group_i)
countsdata[,group_i]/coldata$Norm[coldata$Group==group_i])
countsdataNew
# genotype1 genotype2 genotype3 genotype4
#1 10 50 8 8
#2 20 100 10 10
#3 30 150 12 12
#4 40 200 14 14
If the column name in 'countsdata' and 'Group' column from 'countsdata' are in the same order, we can do this easily with Map
Map(`/`, countsdata, coldata$Norm)
Or just replicate the 'Norm' and do a simple division
countsdata/coldata$Norm[col(countsdata)]
Or with sweep
sweep(countsdata, 2, coldata$Norm, "/")

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Resources