I scraped 99 user profiles from forums for my PhD research.
The output is a list with 99 elements. Since each user can decide for himself which information he or she is going to put on the profile there's a different number of information snippets attached to each element.
Here's a sample of the output (I also don't know why the numeration has all these $ and ' signs) :
$`77.1`
$`77.1`[[1]]
[1] "Username:"
$`77.1`[[2]]
[1] "*Username*"
$`77.1`[[3]]
[1] "*Username*"
$`77.1`[[4]]
[1] "Rank:"
$`77.1`[[5]]
[1] "*Rank*"
$`77.1`[[6]]
[1] "Groups:"
$`77.1`[[7]]
[1] "*Groups*"
$`77.1`[[8]]
[1] "Location:"
$`77.1`[[9]]
[1] "*Location*"
$`77.1`[[10]]
[1] ""
$`78.1`
$`78.1`[[1]]
[1] "Username:"
$`78.1`[[2]]
[1] "*Username*"
$`78.1`[[3]]
[1] "*Username*"
$`78.1`[[4]]
[1] "Rank:"
$`78.1`[[5]]
[1] "*Rank*"
$`78.1`[[6]]
[1] "Age:"
$`78.1`[[7]]
[1] "*AGE*"
$`78.1`[[8]]
[1] "Groups:"
$`78.1`[[9]]
[1] "*Groups*"
$`78.1`[[10]]
[1]"Interests in history:"
$`78.1`[[11]]
[1] "*Interests*"
$`78.1`[[12]]
[1] "Location:"
$`78.1`[[13]]
[1] "*Location*"
$`78.1`[[14]]
[1] ""
Is there a way to arrange this list into a data frame where each row consists of information from one element?
I tried to arrange them into a matrix, but this doesn't work well because the matrix needs a consistent amount of columns, which isn't given.
I would love it to look like this:
Id 1 2 3 4 5 6
1 Username: *Username* Rank *Rank* Groups: *Groups*
2 Username: *Username2* ...
Related
I have the first dataset called exprs:
> class(exprs)
[1] "matrix"
> dim(exprs)
[1] 191812 89
My second dataset is called pData:
> class(pData)
[1] "data.frame"
> dim(pData)
[1] 89 3
However when I run:
all(rownames(pData)==colnames(exprs))
[1] FALSE
It results in FALSE. I need the final output to be TRUE.
Is this because one class = data.frame while the other class=matrix?
I am working with a vector of strings in r. However, when I see the first item in the list I see this:
> uni_list[1]
[1] c("ENSMUSG00000000204", "ENSMUSG00000115878", "ENSMUSG00000116453", "ENSMUSG00000116134")
15940 Levels: c("ENSMUSG00000000204", "ENSMUSG00000115878", "ENSMUSG00000116453", "ENSMUSG00000116134")
How can I split this one in separate values?
Thanks in advance,
Juan
You can use split, i.e.
split(l3[[1]], seq(length(l3[[1]])))
$`1`
[1] "ENSMUSG00000000204"
$`2`
[1] "ENSMUSG00000115878"
$`3`
[1] "ENSMUSG00000116453"
$`4`
[1] "ENSMUSG00000116134"
where
l3
[[1]]
[1] "ENSMUSG00000000204" "ENSMUSG00000115878" "ENSMUSG00000116453" "ENSMUSG00000116134"
I'm working on a shiny app which plots data trees. I'm looking to incorporate the shinyTree app to permit quick comparison of plotted nodes. The issue is that the shinyTree app returns a redundant list of lists of the sub node plot.
The actual list of list is included below. I would like to keep the longest branches only. I would also like to remove the id node (integer node), I'm struggling as to why it even shows up based on the list. I have tried many different methods to work with this list but it's been a real struggle. The list concept is difficult to understand.
I create the data.tree and plot via:
dataTree.a <- FromListSimple(checkList)
plot(dataTree.a)
> checkList
[[1]]
[[1]]$Asia
[[1]]$Asia$China
[[1]]$Asia$China$Beijing
[[1]]$Asia$China$Beijing$Round
[[1]]$Asia$China$Beijing$Round$`20383994`
[1] 0
[[2]]
[[2]]$Asia
[[2]]$Asia$China
[[2]]$Asia$China$Beijing
[[2]]$Asia$China$Beijing$Round
[1] 0
[[3]]
[[3]]$Asia
[[3]]$Asia$China
[[3]]$Asia$China$Beijing
[1] 0
[[4]]
[[4]]$Asia
[[4]]$Asia$China
[[4]]$Asia$China$Shanghai
[[4]]$Asia$China$Shanghai$Round
[[4]]$Asia$China$Shanghai$Round$`23740778`
[1] 0
[[5]]
[[5]]$Asia
[[5]]$Asia$China
[[5]]$Asia$China$Shanghai
[[5]]$Asia$China$Shanghai$Round
[1] 0
[[6]]
[[6]]$Asia
[[6]]$Asia$China
[[6]]$Asia$China$Shanghai
[1] 0
[[7]]
[[7]]$Asia
[[7]]$Asia$China
[1] 0
[[8]]
[[8]]$Asia
[[8]]$Asia$India
[[8]]$Asia$India$Delhi
[[8]]$Asia$India$Delhi$Round
[[8]]$Asia$India$Delhi$Round$`25703168`
[1] 0
[[9]]
[[9]]$Asia
[[9]]$Asia$India
[[9]]$Asia$India$Delhi
[[9]]$Asia$India$Delhi$Round
[1] 0
[[10]]
[[10]]$Asia
[[10]]$Asia$India
[[10]]$Asia$India$Delhi
[1] 0
[[11]]
[[11]]$Asia
[[11]]$Asia$India
[1] 0
[[12]]
[[12]]$Asia
[[12]]$Asia$Japan
[[12]]$Asia$Japan$Tokyo
[[12]]$Asia$Japan$Tokyo$Round
[[12]]$Asia$Japan$Tokyo$Round$`38001000`
[1] 0
[[13]]
[[13]]$Asia
[[13]]$Asia$Japan
[[13]]$Asia$Japan$Tokyo
[[13]]$Asia$Japan$Tokyo$Round
[1] 0
[[14]]
[[14]]$Asia
[[14]]$Asia$Japan
[[14]]$Asia$Japan$Tokyo
[1] 0
[[15]]
[[15]]$Asia
[[15]]$Asia$Japan
[1] 0
[[16]]
[[16]]$Asia
[1] 0
Well, I did cobble together a poor hack to make this work here is what I did to the 'checkList' list
checkList <- get_selected(tree, format = "slices")
# Convert and collapse shinyTree slices to data.tree
# This is a bit of a cluge to work the graphic with
# shinyTree an alternate one liner is in works
# This transform works by finding the longest branches
# and only plotting them since the other branches are
# subsets due to the slices.
# Extract the checkList name (as characters) from the checkList
tmp <- names(unlist(checkList))
# Determine the length of the individual checkList Names
lens <- lapply(tmp, function(x) length(strsplit(x, ".", fixed=TRUE)[[1]]))
# Find the elements with the highest length returns a list of high vals
lens.max <- which(lens == max(sapply(lens, max)))
# Replace all '.' with '\' prepping for DataFrameTable Converions
tmp <- relist(str_replace_all(tmp, "\\.", "/"), skeleton=tmp)
# Add a root node to work with multiple branches
tmp <- unlist(lapply(tmp, function(x) paste0("Root/", x)))
# Create a list of only the longest branches
longBranches <- as.list(tmp[lens.max])
# Convert the list into a data.frame for convert
longBranches.df <- data.frame(pathString = do.call(rbind, longBranches))
# Publish the data.frame for use
vals$selDF <- longBranches.df
#save(checkList, file = "chkLists.RData") # Save for troubleshooting
print(vals$selDF)ode here
The new checkList looks like this:
[1] "Root/Europe/France/Paris/Round/10843285" "Root/Europe/France/Paris/Round"
[3] "Root/Europe/France/Paris" "Root/Europe/France"
[5] "Root/Europe/Germany/Berlin/Diamond/3563194" "Root/Europe/Germany/Berlin/Diamond"
[7] "Root/Europe/Germany/Berlin/Round/3563194" "Root/Europe/Germany/Berlin/Round"
[9] "Root/Europe/Germany/Berlin" "Root/Europe/Germany"
[11] "Root/Europe/Italy/Rome/Round/3717956" "Root/Europe/Italy/Rome/Round"
[13] "Root/Europe/Italy/Rome" "Root/Europe/Italy"
[15] "Root/Europe/United Kingdom/London/Round/10313307" "Root/Europe/United Kingdom/London/Round"
[17] "Root/Europe/United Kingdom/London" "Root/Europe/United Kingdom"
[19] "Root/Europe"
It works :)... but I think this could be done with a two liner.... I'll work on it again in a week or so. Any other Ideas would be appreciated.
So I've been trying to get a subset of a character vector for the last hour or so. In my (floundering) attempt to get this working I ran into an interesting characteristic of R. I have data (after JSON parsing) in the form of
[[1]]
[[1]]$business_id
[1] "rncjoVoEFUJGCUoC1JgnUA"
[[1]]$full_address
[1] "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345"
[[1]]$open
[1] TRUE
[[1]]$categories
[1] "Accountants" "Professional Services" "Tax Services"
[4] "Financial Services"
[[1]]$city
[1] "Peoria"
[[1]]$review_count
[1] 3
[[1]]$name
[1] "Peoria Income Tax Service"
[[1]]$neighborhoods
list()
[[1]]$longitude
[1] -112.2416
[[1]]$state
[1] "AZ"
[[1]]$stars
[1] 5
[[1]]$latitude
[1] 33.58187
[[1]]$type
[1] "business"
Here's the code I'm using
#!/usr/bin/Rscript
require(graphics)
require(RJSONIO)
parsed_data <- lapply(readLines("yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json"), fromJSON)
#parsed_data[,c("categories")]
print(parsed_data[1])
As I was trying to drop everything but the categories column I ran into this interesting behaviour
print(parsed_data[1])
print(parsed_data[1][1])
print(parsed_data[1][1][1][1][1][1])
All produce the same output (the one posted above). Why is that?
This is the difference between [ and [[. It is hard to search for these online, but ?'[' will bring up the help.
When indexing a list with [, a list is returned:
list(a=1:10, b=11:20)[1]
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
This is a list of one element, so repeating the operation again results in the same value:
list(a=1:10, b=11:20)[1][1]
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
[[ returns the element, not a list containing the element. It also only accepts a single index (whereas [ accepts a vector):
list(a=1:10, b=11:20)[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
And this operation is not idempotent on lists:
list(a=1:10, b=11:20)[[1]][[1]]
## [1] 1
Your JSON data is currently stored in a list, rather than a vector, so the indexing is different.
As Matthew has pointed out, there is a difference between using [] to access an element and using [[]]. For a discussion on this I will refer you to this stack overflow thread:
In R, what is the difference between the [] and [[]] notations for accessing the elements of a list?
Looking at the data print out your data is stored as a nested list:
parsed_data[[1]]
Will give you a list containing each of the columns. To access the categories column you can use any of the following:
parsed_data[[1]][["categories"]]
parsed_data[[1]][[4]]
parsed_data[[1]]$categories
This will give you a vector of names as a you'd expect:
## [1] "Accountants" "Professional Services" "Tax Services"
## [4] "Financial Services"
Note that when accessing by index (either named or numeric) you still have to use the double bracket notation: [[]]. If you use [] instead, it will give you a list instead of a vector:
parsed_data[[1]]["categories"]
## [[1]]
## [1] "Accountants" "Professional Services" "Tax Services"
## [4] "Financial Services"
I have these two characters and the "as.numeric" function doesn't work same for them. Can anyone help me why this is happening?
options(digits=22)
a="27"
as.numeric(a)
[1] 27.00000000000000000000
a="193381411288395777"
as.numeric(a)
[1] 193381411288395776.0000
It can be seen that in the second case the last digit is not "7" and it is "6". Basically the "as.numeric" function decreases 1 unit from the number in the second case.
Any help is appreciated.
You need to learn about the limits of representation of exact numbers. R can tell you what it has:
R> .Machine
$double.eps
[1] 2.22045e-16
$double.neg.eps
[1] 1.11022e-16
$double.xmin
[1] 2.22507e-308
$double.xmax
[1] 1.79769e+308
$double.base
[1] 2
$double.digits
[1] 53
$double.rounding
[1] 5
$double.guard
[1] 0
$double.ulp.digits
[1] -52
$double.neg.ulp.digits
[1] -53
$double.exponent
[1] 11
$double.min.exp
[1] -1022
$double.max.exp
[1] 1024
$integer.max
[1] 2147483647
$sizeof.long
[1] 8
$sizeof.longlong
[1] 8
$sizeof.longdouble
[1] 16
$sizeof.pointer
[1] 8
R>
Use the int64 package:
library(int64)
> as.int64("193381411288395777")
[1] 193381411288395777