row.name using `structure` function as c(NA, *integer*) - r

Does anyone know why when I run this:
row.names(structure(list(speed = c(4, 7), dist = c(2, 22)),
row.names = c(NA, 2L), class = "data.frame"))
I get this:
# "1" "2"
and not c(NA, 2L)? I mean what row.names argument in structure exactly does to the argument?
I came across this when I tried to use dput to see the structure of some dataframes. e.g.
dput(cars)
And I noticed the row.names argument in it, which is: c(NA,
-50L).

c(NA, n) is how data frames internally store the row names in the common case of 1:n so as to save space and processing time. This is not supposed to be accessible to the user who is to regard it as "1", "2", ... so the accessor functions translate it.

Related

How I get names into the specific strings

I have the following vector:
a <- c("teste3/Nova pasta3/texto33.txt", "teste3/texto3.txt", "teste3/Nova pasta3",
"teste3")
In certain cases I have not a vector, but a dataframe
structure(list(filename = c("teste1/", "teste1/Nova pasta1/",
"teste1/Nova pasta1/texto11.txt", "teste1/texto1.txt", "teste1/New Folder/"
)), class = "data.frame", row.names = c(NA, -5L))
I would to get the names that are between slash bar (/*/).
In this case just the name (Nova pasta3) for the vector and the name (Nova pasta1) for the dataframe.
Thanks

How do I use the list name as part of column name using tidyr?

I have a nested JSON file that is a pretty simply structure. The list name is flavor, and then there is a nested df one level below. One of the columns is nested further. How can I use the name of the list "flavors" as a prefix to the column names when I unnest? I would be looking for column names like flavor.id, flavor.name, etc.
I don't have a great reprex example, but I'd be looking to use some form of tidyr or purrr. I tried to use purrr::flatten() to no avail.
Sample Reprex
sample <- list(
flavor = structure(list(nested_col = list(structure(list(column = 0L,id = "B30D41F4-5684-11E1-8E9A-8F09EE5110CB"), class = "data.frame", row.names = 1L),
structure(list(id = "B30B5B28-5684-11E1-8E9A-8F09EE5110CB", column = 0L), class = "data.frame", row.names = 1L)),
short_name = c("Bi", "Br"), abbr = c("RR", "CHOC"), long_abbr = c("BXB","BK"), id = c("13", "11"), name = c("Rock n Road","Chocolate")), class = "data.frame", row.names = c(NA,2L)))
I would be looking to extract the list into a tbl_df. The corresponding columns would look something like flavor_nested_col, flavor_short_name, etc.

How to fix a large dataframe in R

It should be a simple solution, but I don't have a very deep understanding of how R handles data.
I have a dataframe that is the result of importing two columns from a .xlsx file. I'm trying to use the library anytime to convert a unix timestamp to an R friendly date. I had no problem with a previous dataframe, and from what I can see this new one is the same structure.
Here is the dput from each dataframe:
> dput(head(test3,10))
structure(list(city_name = c(NA, NA, "Northampton", NA, "Parkville",
"San Jose", "San Jose", NA, "Parkville", "Northampton"), dateline = c(1281496979,
1313188858, 1313188895, 1313188913, 1313188938, 1313188957, 1313188987,
1313189030, 1313189067, 1313189204)), row.names = 87:96, class = "data.frame")
> dput(head(user,10))
structure(list(userid = c(1, 1, 1, 3, 5, 4, 6, 4, 3, 5), dateline = c(1281496979,
1281496979, 1281496990, 1281507443, 1281508294, 1281508362, 1281508399,
1281508589, 1281508603, 1281508629)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The user dataframe is the broken one. When I try to run anytime, I get this error:
Error in anytime_cpp(x, tz = tz, asUTC = asUTC, useR = useR, oldHeuristic = oldHeuristic) :
Unsupported Type
Through my own troubleshooting, I figured out that when I remove "tbl_df",
"tbl", from class = then the user dataframe interacts properly with anytime. However, the dataframe is about 900,000 rows long, so I can't solve this with dput. How can I fix the structure of my user dataframe?

fuzzy matching in DNA seqs

For the purposes of the reprex I've generated a tibble called random_DNA_tbl that is a random selection of 10 DNA sequences (of 100 bases). I've got a separate tibble called subseq_tbl, with 3 shorter sequences that match 100% to 3 of the sequences in random_DNA_tbl, but I'd also like to use fuzzy matching of sequences from subseq_tbl to other sequences in random_DNA_tbl. I was hopping to be able to use the fuzzyjoin package stringdist_XX_join functions, however these don't seem to work, even though the subseq sequences are actually perfect matches and do work with other matching functions, e.g. regex_XX_join.
library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2",
"random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6",
"random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC",
"TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA",
"GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT",
"GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG",
"ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC",
"ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC",
"GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA",
"CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT",
"GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT",
"CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Doesn't work:
stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
Does work:
regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
I've tried tweaking the max_dist parameter in stringdist but to no avail. Can anyone shed any light on the problem please?

R convert dataframe to JSON

I have a dataframe that I'd like to convert to json format:
my data frame called res1:
library(rjson)
structure(list(id = c(1, 2, 3, 4, 5), value = structure(1:5, .Label = c("server1",
"server2", "server3", "server4", "server5"), class = "factor")), .Names = c("id",
"value"), row.names = c(NA, -5L), class = "data.frame")
when I do:
toJSON(res1)
I get this:
{"id":[1,2,3,4,5],"value":["server1","server2","server3","server4","server5"]}
I need this json output to be like this, any ideas?
[{"id":1,"value":"server1"},{"id":2,"value":"server2"},{"id":3,"value":"server3"},{"id":4,"value":"server4"},{"id":5,"value":"server5"}]
The jsonlite package exists to address exactly this problem: "A practical and consistent mapping between JSON data and R objects."
Its toJSON function provides this desired result with the default options:
library(jsonlite)
x <- toJSON(res1)
cat(x)
## [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
## {"id":3,"value":"server3"},{"id":4,"value":"server4"},
## {"id":5,"value":"server5"}]
How about
library(rjson)
x <- toJSON(unname(split(res1, 1:nrow(res1))))
cat(x)
# [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
# {"id":3,"value":"server3"},{"id":4,"value":"server4"},
# {"id":5,"value":"server5"}]
By using split() we are essentially breaking up the large data.frame into a separate data.frame for each row. And by removing the names from the resulting list, the toJSON function wraps the results in an array rather than a named object.
Now you can easily just call jsonlite::write_json() directly on the dataframe.
You can also use library(jsonify)
jsonify::to_json( res1 )
# [{"id":1.0,"value":"server1"},{"id":2.0,"value":"server2"},{"id":3.0,"value":"server3"},{"id":4.0,"value":"server4"},{"id":5.0,"value":"server5"}]

Resources