Why subset cut decimal part? - r

Hi this is a sample of data.frame / list with two columns containing X and Y. And my problem is when I call subset it will cut decimal part. Can you help me figure why?
(row.names | X | Y)
> var
...
9150 4246838.57 5785639.07
9152 4462019.15 5756344.11
9153 4671745.07 5791092.53
9154 4825699.93 5767058.37
9155 4935126.99 5839357.55
> typeof(var)
[1] "list"
> var = subset(var, Y>10980116 & X>3217133)
...
6569 15163607 11323070
6572 15102381 11079465
6573 16462260 11272569
6577 19028175 11095784
It's the same when I use:
> var = var[var$Y>10980116 & var$X>3217133,]
Thank you for your help.

This is not a subsetting issue, it's a formatting/presentation issue. You're in the first circle of Burns's R Inferno ("[i]f you are using R and you think you’re in hell, this is a map for you"):
another aspect of virtuous pagan beliefs—what is printed is all
that there is
If we just print this bit of the data frame exactly as entered, we "lose" digits.
> df <- read.table(text="
4246838.57 5785639.07
4462019.15 5756344.11
4671745.07 5791092.53
4825699.93 5767058.37
4935126.99 5839357.55",
header=FALSE)
> df
## V1 V2
## 1 4246839 5785639
## 2 4462019 5756344
## 3 4671745 5791093
## 4 4825700 5767058
## 5 4935127 5839358
Tell R you want to see more precision:
> print(df,digits=10)
## V1 V2
## 1 4246838.57 5785639.07
## 2 4462019.15 5756344.11
## 3 4671745.07 5791092.53
## 4 4825699.93 5767058.37
## 5 4935126.99 5839357.55
Or you can set options(digits=10) (the default is 7).

Related

lcmm::predictClass with l-spline link function

I am getting an error message trying to predict class membership in lcmm::predictClass(). This seems to be due to using a spline-based link function, as exemplified below. The lcmm::predictClass() function works okay for the default link function.
The following shows 1) a reproduceable example giving the error message, and 2) a working example with the same broad approach.
## define initialisation values for quick result here
BB <- c(-19.064,21.718,-1.192,-1.295,-1.205,-0.281,0.110,
-0.232, 1.339,-1.007, 1.019,-9.395, 1.702,2.030,
2.089, 1.352,-9.369, 1.220, 1.532, 2.481,1.223)
library(lcmm)
m2c <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="3-quant-splines",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=BB)
## converges in 3 iterations
## define the prediction cases
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>% ## 10 random IDs
select(ID,Ydep1,Ydep2,Time,X1,X2)
## find predicted class memberships
predictClass(m2c, newdata=X)
## Error in multlcmm(fixed = Ydep1 + Ydep2 ~ 1 + Time * X2, mixture = ~1 + :
## Length of vector range is not correct.
On the other hand, a similar approach with a linear link function gives the following. Note that these models are based on the example in the ?multlcmm help section.
library(lcmm)
m2 <- multlcmm(Ydep1+Ydep2~1+Time*X2,
random=~1+Time,
subject="ID",
link="linear",
ng=2,
mixture=~1+Time,
classmb=~1+X1,
data=data_lcmm,
B=c(18,-20.77,1.16,-1.41,-1.39,-0.32,0.16,
-0.26,1.69,1.12,1.1,10.8,1.24,24.88,1.89))
## converges in 2 iterations
library(dplyr)
X <- data_lcmm %>%
filter(ID %in% sample(ID,10)) %>%
select(ID,Ydep1,Ydep2,Time,X1,X2)
predictClass(m2, newdata=X)
## ID class prob1 prob2
## 1 21 2 0.031948951 9.680510e-01
## 2 25 2 0.042938984 9.570610e-01
## 3 33 2 0.026053178 9.739468e-01
## 4 46 1 0.999999964 3.597409e-08
## 5 50 2 0.066291287 9.337087e-01
## 6 74 2 0.005630593 9.943694e-01
## 7 120 2 0.024787290 9.752127e-01
## 8 171 2 0.053499974 9.465000e-01
## 9 229 1 0.999999996 4.368222e-09
##10 235 2 0.008173507 9.918265e-01
## ...or similar
The other predict functions predictL() and predictY() seem to work okay. The predictRE() throws the same error message.
I will also email the package maintainer.

How can implement a function?

I have two data files as below:
head (RNA)
Gene_ID chr start end
1 ENSG00000000003.1 X 99883667 99884983
2 ENSG00000000003.2 X 99885756 99885863
3 ENSG00000000003.3 X 99887482 99887565
4 ENSG00000000003.4 X 99888402 99888536
5 ENSG00000000003.5 X 99888928 99889026
6 ENSG00000000003.6 X 99890175 99890249
head(snp)
chr start end SNP_No
1 1 58812 58812 SNP_1
2 1 67230 67230 SNP_2
3 1 79529 79529 SNP_3
4 1 79595 79595 SNP_4
5 1 85665 85665 SNP_5
6 1 86064 86064 SNP_6
I would like to find overlap between snp file and RNA file, so I used GenomicRanges R package and I have done below commands:
gr_RNA <- GRanges(seqnames=RNA$chr,IRanges(start=RNA$start,end=RNA$end,names=RNA$Gene_ID))
gr_SNP <- GRanges(seqnames=SNP$chr, IRanges(start=SNP$start,end=SNP$end,names=SNP$SNP_No))
overlaps <- findOverlaps(gr_RNA, gr_SNP)
subsetByOver <- subsetByOverlaps(gr_RNA, gr_SNP)
match_hit <- data.frame(names(gr_RNA)[queryHits(overlaps)],names(gr_SNP)[subjectHits(overlaps)],stringsAsFactors=F)
names(match_hit) <- c('Gene_ID','SNP')
head(match_hit)
Gene_ID SNP
1 ENSG00000000457.1 SNP_307301
2 ENSG00000000457.2 SNP_307307
3 ENSG00000000457.11 SNP_307365
4 ENSG00000000457.12 SNP_307387
5 ENSG00000000460.1 SNP_306845
6 ENSG00000000460.1 SNP_306846
dim(match_hit)
[1] 12287 2
Then I expanded distance for start and end position from RNA file ("start-100" and "end+100")and run scripts again as below:
gr_RNA1 <- GRanges(seqnames=RNA$chr, IRanges(start=(RNA$start)-100, end=(RNA$end)+100, names=RNA$Gene_ID))
overlaps <- findOverlaps(gr_RNA1, gr_SNP)
subsetByOver<-subsetByOverlaps(gr_RNA1, gr_SNP)
match_hit1 <- data.frame(names(gr_RNA1)[queryHits(overlaps)],names(gr_SNP)[subjectHits(overlaps)],stringsAsFactors=F)
dim(match_hit1)
[1] 17976 2
Now, I want to implement a function which takes the RNA table, the SNP table, and the expand distance, then give me final results.
Functions in R are defined like this:
myFunction <- function(parameters) {
#function Code
return(result)
}
see also

R : Create specific bin based on data range

I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107

extracting hashtags from tweets

I am trying to perform sentiment analysis and facing a small problem. I am using a dictionary which has hashtags and some other junk value(shown below). It also has associated weight of the hashtag. I want to extract only the hashtags and its corresponding weight into a new data frame. Is there any easy way to do it?
I have tried using regmatches, but some how its giving output in list format and is messing things up.
Input:
V1 V2
1 #fabulous 7.526
2 #excellent 7.247
3 superb 7.199
4 #perfection 7.099
5 #terrific 6.922
6 #magnificent 6.672
Output:
V1 V2
1 #fabulous 7.526
2 #excellent 7.247
3 #perfection 7.099
4 #terrific 6.922
5 #magnificent 6.672
To select only the entries that are hashtags you can use the simple regex ^# (meaning "anything that starts with a #"):
> input[grepl("^#",input[,1]),]
V1 V2
1 #fabulous 7.526
2 #excellent 7.247
4 #perfection 7.099
5 #terrific 6.922
6 #magnificent 6.672
Otherwise from your original data, the regex #[[:alnum:]]+ (meaning: "an hashtag, followed by 1 or more alphanumerical characters") should help you grab the hashtags:
> tweets <- c("New R job: Statistical and Methodological Consultant at the Center for Open Science http://www.r-users.com/jobs/statistical-methodological-consultant-center-open-science/ … #rstats #jobs","New R job: Research Engineer/Applied Researcher at eBay http://www.r-users.com/jobs/research-engineerapplied-researcher-ebay/ … #rstats #jobs")
> match <- regmatches(tweets,gregexpr("#[[:alnum:]]+",tweets))
> match
[[1]]
[1] "#rstats" "#jobs"
[[2]]
[1] "#rstats" "#jobs"
> unlist(match)
[1] "#rstats" "#jobs" "#rstats" "#jobs"
This code should work and will give you desired output as data.frame
Input<- data.frame(V1 = c("#fabulous","#excellent","superb","#perfection","#terrific","#magnificent"), V2 = c("7.526", "7.247" , "7.199", "7.099", "6.922", "6.672"))
extractHashtags <- Input[which(substr(Input$V1,1,1) == "#"),]
View(extractHashtags)

Adding variables to a data.frame using a string as syntax

Supose I have this variables:
data <- data.frame(x=rnorm(10), y=rnorm(10))
form <- 'z = x*y'
How can I compute z (using data's variables) and add as a new variable to data?
I tried with parse() and eval() (base on an old question), but without success :/
Given what #Nico said is correct you might do:
d1 <- within(data, eval(parse(text=form)) )
d1
x y z
1 0.5939462 1.58683345 0.94249368
2 0.3329504 0.55848643 0.18594826
3 1.0630998 -1.27659221 -1.35714497
4 -0.3041839 -0.57326541 0.17437812
5 0.3700188 -1.22461261 -0.45312970
6 0.2670988 -0.47340064 -0.12644474
7 -0.5425200 -0.62036668 0.33656135
8 1.2078678 0.04211587 0.05087041
9 1.1604026 -0.91092165 -1.05703586
10 0.7002136 0.15802877 0.11065390
transform() is the easy way if using this interactively:
data <- data.frame(x=rnorm(10), y=rnorm(10))
data <- transform(data, z = x * y)
R> head(data)
x y z
1 -1.0206 0.29982 -0.30599
2 -1.6985 1.51784 -2.57805
3 0.8940 1.19893 1.07187
4 -0.3672 -0.04008 0.01472
5 0.5266 -0.29205 -0.15381
6 0.2545 -0.26889 -0.06842
You can't do this using form though, but within(), which is similar to transform(), does allow this, e.g.
R> within(data, eval(parse(text = form)))
x y z
1 -0.8833 -0.05256 0.046428
2 1.6673 1.61101 2.686115
3 1.1261 0.16025 0.180453
4 0.9726 -1.32975 -1.293266
5 -1.6220 -0.51079 0.828473
6 -1.1981 2.62663 -3.147073
7 -0.3596 -0.01506 0.005416
8 -0.9700 0.21865 -0.212079
9 1.0626 1.30377 1.385399
10 -0.8020 -1.04639 0.839212
though it involves some amount of jiggery-pokery with the language which to my mind is not elegant. Effectively, you are doing something like this:
R> eval(eval(parse(text = form), data), data, parent.frame())
[1] 0.046428 2.686115 0.180453 -1.293266 0.828473 -3.147073 0.005416
[8] -0.212079 1.385399 0.839212
(and assigning the result to the named component in data.)
Does form have to come like this, as a character string representing some expression to be evaluated?

Resources