Is there a way to equalise two different datasets in R? - r

I have the first dataset called exprs:
> class(exprs)
[1] "matrix"
> dim(exprs)
[1] 191812 89
My second dataset is called pData:
> class(pData)
[1] "data.frame"
> dim(pData)
[1] 89 3
However when I run:
all(rownames(pData)==colnames(exprs))
[1] FALSE
It results in FALSE. I need the final output to be TRUE.
Is this because one class = data.frame while the other class=matrix?

Related

How to extract values from a list with multiple levels in r

I have a list looks like this
[[1]]
[[1]][[1]]
[[1]][[1]]$p1est.z
[1] 2.890829
[[1]][[1]]$p1se.z
[1] 0.1418367
[[1]][[2]]
[[1]][[2]]$p2est.w
[1] 4.947014
[[1]][[2]]$p2se.w
[1] 0.5986682
[[2]]
[[2]][[1]]
[[2]][[1]]$p1est.z
[1] 3.158164
[[2]][[1]]$p1se.z
[1] 0.138770
[[2]][[2]]
[[2]][[2]]$p2est.w
[1] 5.052874
[[2]][[2]]$p2se.w
[1] 0.585608
How can I extract values of "p1est.z" from both levels? since I need to compute the average of them.
Thanks!
Actually the unlist() function out of the box should probably work here:
output <- unlist(your_list)
output[names(output) == "p1est.z"]
p1est.z p1est.z
2.890829 3.158164
Data:
your_list <- list(
list(list(p1est.z=2.890829, p1se.z=0.1418367),
list(p1est.w=4.947014, p2se.w=0.5986682)),
list(list(p1est.z=3.158164, p1se.z=0.138770),
list(p1est.w=5.052874, p2se.w=0.585608)))
One way to do this, using Tim Biegeleisen's representation of your data is to make a function to extract p1est.z and apply that. Your top level list has two elements, in both, the first element has a p1est.z so you could do
fn <- function(x) { x[[1]]$p1est.z }
and then apply it
sapply(your_list, fn)
# [1] 2.890829 3.158164

Split c() inside of a string vector

I am working with a vector of strings in r. However, when I see the first item in the list I see this:
> uni_list[1]
[1] c("ENSMUSG00000000204", "ENSMUSG00000115878", "ENSMUSG00000116453", "ENSMUSG00000116134")
15940 Levels: c("ENSMUSG00000000204", "ENSMUSG00000115878", "ENSMUSG00000116453", "ENSMUSG00000116134")
How can I split this one in separate values?
Thanks in advance,
Juan
You can use split, i.e.
split(l3[[1]], seq(length(l3[[1]])))
$`1`
[1] "ENSMUSG00000000204"
$`2`
[1] "ENSMUSG00000115878"
$`3`
[1] "ENSMUSG00000116453"
$`4`
[1] "ENSMUSG00000116134"
where
l3
[[1]]
[1] "ENSMUSG00000000204" "ENSMUSG00000115878" "ENSMUSG00000116453" "ENSMUSG00000116134"

Conversion of Elastic list data output to R data frame slow

I have an output from Elastic that takes very long to convert to an R data frame. I have tried multiple options; and feel there may be some trick there to quicken the process.
The structure of the list is as follows. The list has aggregated data over 29 days (say). If lets say the Elastic query output is in list 'v_day' then l[[5]]$articles_over_time$buckets[1:29] represents each of the 29 days
length(v_day[[5]]$articles_over_time$buckets)
[1] 29
page(v_day[[5]]$articles_over_time$buckets[[1]],method="print")
$key
[1] 1446336000000
$doc_count
[1] 35332
$group_by_state
$group_by_state$doc_count_error_upper_bound
[1] 0
$group_by_state$sum_other_doc_count
[1] 0
$group_by_state$buckets
$group_by_state$buckets[[1]]
$group_by_state$buckets[[1]]$key
[1] "detail"
$group_by_state$buckets[[1]]$doc_count
[1] 876
There is a "key" value here right at the top here (1446336000000) that I am interested in (lets call it "time bucket key").
Within each day(lets take day i), "v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets" has more data I am interested in. This is an aggregation over each property (property is an entity in the scheme of things here).
page(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets,method="print")
[[1]]
[[1]]$key
[1] "detail"
[[1]]$doc_count
[1] 876
[[2]]
[[2]]$key
[1] "ff8081814fdf2a9f014fdf80b05302e0"
[[2]]$doc_count
[1] 157
[[3]]
[[3]]$key
[1] "ff80818150a7d5930150a82abbc50477"
[[3]]$doc_count
[1] 63
[[4]]
[[4]]$key
[1] "ff8081814ff5f428014ffb5de99f1da5"
[[4]]$doc_count
[1] 57
[[5]]
[[5]]$key
[1] "ff8081815038099101503823fe5d00d9"
[[5]]$doc_count
[1] 56
This shows data over 5 properties in day i, each property has a "key" (lets call it "property bucket key") and a "doc_count" that I am interested in.
Eventually I want a data frame with "time bucket key", "property bucket key", "doc count".
Currently I am looping over using the below code:
v <- NULL
ndays <- length(v_day[[5]]$articles_over_time$buckets)
for (i in 1:ndays) {
v1 <- do.call("rbind", lapply(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets, data.frame))
th_dt <- as.POSIXct(v_day[[5]]$articles_over_time$buckets[[i]]$key / 1000, origin="1970-01-01")
v1$view_date <- th_dt
v <- rbind(v, v1)
msg <- sprintf("Read views for %s. Found %d \n", th_dt, sum(v1$doc_count))
cat(msg)
}
v

Rearranging list into data.frame

I scraped 99 user profiles from forums for my PhD research.
The output is a list with 99 elements. Since each user can decide for himself which information he or she is going to put on the profile there's a different number of information snippets attached to each element.
Here's a sample of the output (I also don't know why the numeration has all these $ and ' signs) :
$`77.1`
$`77.1`[[1]]
[1] "Username:"
$`77.1`[[2]]
[1] "*Username*"
$`77.1`[[3]]
[1] "*Username*"
$`77.1`[[4]]
[1] "Rank:"
$`77.1`[[5]]
[1] "*Rank*"
$`77.1`[[6]]
[1] "Groups:"
$`77.1`[[7]]
[1] "*Groups*"
$`77.1`[[8]]
[1] "Location:"
$`77.1`[[9]]
[1] "*Location*"
$`77.1`[[10]]
[1] ""
$`78.1`
$`78.1`[[1]]
[1] "Username:"
$`78.1`[[2]]
[1] "*Username*"
$`78.1`[[3]]
[1] "*Username*"
$`78.1`[[4]]
[1] "Rank:"
$`78.1`[[5]]
[1] "*Rank*"
$`78.1`[[6]]
[1] "Age:"
$`78.1`[[7]]
[1] "*AGE*"
$`78.1`[[8]]
[1] "Groups:"
$`78.1`[[9]]
[1] "*Groups*"
$`78.1`[[10]]
[1]"Interests in history:"
$`78.1`[[11]]
[1] "*Interests*"
$`78.1`[[12]]
[1] "Location:"
$`78.1`[[13]]
[1] "*Location*"
$`78.1`[[14]]
[1] ""
Is there a way to arrange this list into a data frame where each row consists of information from one element?
I tried to arrange them into a matrix, but this doesn't work well because the matrix needs a consistent amount of columns, which isn't given.
I would love it to look like this:
Id 1 2 3 4 5 6
1 Username: *Username* Rank *Rank* Groups: *Groups*
2 Username: *Username2* ...

Two (supposedly) identical date objects in R are not equal?

I have a simple question. I have two Date objects in R that are supposed to be identical (they have the same value and class), but R is saying they are not equal. I am running on linux though I get the same result on a windows machine. Why is this happening?
code:
start=as.Date("2014-12-31")
finish=as.Date("2014-11-28")
dates = seq(start,finish,length=6)
christmasEve = as.Date("2014-12-24")
print(dates[2])
print(christmasEve)
print(class(dates[2]))
print(class(christmasEve))
(christmasEve==dates[2])
output:
[1] "2014-12-24"
[1] "2014-12-24"
[1] "Date"
[1] "Date"
[1] FALSE
Any help would be greatly appreciated!
-Paul
The problem is that you are dividing a number of days that is not a multiple of six by six. Check out:
as.numeric(dates)
# [1] 16435.0 16428.4 16421.8 16415.2 16408.6 16402.0
start - finish
# Time difference of 33 days
Since you are creating the dates as a sequence the dates are not exact round numbers.
> as.numeric(dates)
[1] 16435.0 16428.4 16421.8 16415.2 16408.6 16402.0
> as.numeric(christmasEve)
[1] 16428
> as.character(christmasEve) == as.character(dates[2])
[1] TRUE
It is not possible to test your code as there is no sampleRate. I assumed that sampleRate is 6. You could compare your dates with the code below:
all(as.character(christmasEve) == as.character(dates[2]))
The whole things should work like that
> sampleRate <- 6
>
> start=as.Date("2014-12-31")
> finish=as.Date("2014-11-28")
> dates = seq(start,finish,length=sampleRate)
> christmasEve = as.Date("2014-12-24")
> print(dates[2])
[1] "2014-12-24"
> print(christmasEve)
[1] "2014-12-24"
> print(class(dates[2]))
[1] "Date"
> print(class(christmasEve))
[1] "Date"
> (christmasEve==dates[2])
[1] FALSE
>
> all(christmasEve == dates[2])
[1] FALSE
> all(as.character(christmasEve) == as.character(dates[2])
+ )
[1] TRUE

Resources