How to fix a large dataframe in R - r

It should be a simple solution, but I don't have a very deep understanding of how R handles data.
I have a dataframe that is the result of importing two columns from a .xlsx file. I'm trying to use the library anytime to convert a unix timestamp to an R friendly date. I had no problem with a previous dataframe, and from what I can see this new one is the same structure.
Here is the dput from each dataframe:
> dput(head(test3,10))
structure(list(city_name = c(NA, NA, "Northampton", NA, "Parkville",
"San Jose", "San Jose", NA, "Parkville", "Northampton"), dateline = c(1281496979,
1313188858, 1313188895, 1313188913, 1313188938, 1313188957, 1313188987,
1313189030, 1313189067, 1313189204)), row.names = 87:96, class = "data.frame")
> dput(head(user,10))
structure(list(userid = c(1, 1, 1, 3, 5, 4, 6, 4, 3, 5), dateline = c(1281496979,
1281496979, 1281496990, 1281507443, 1281508294, 1281508362, 1281508399,
1281508589, 1281508603, 1281508629)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The user dataframe is the broken one. When I try to run anytime, I get this error:
Error in anytime_cpp(x, tz = tz, asUTC = asUTC, useR = useR, oldHeuristic = oldHeuristic) :
Unsupported Type
Through my own troubleshooting, I figured out that when I remove "tbl_df",
"tbl", from class = then the user dataframe interacts properly with anytime. However, the dataframe is about 900,000 rows long, so I can't solve this with dput. How can I fix the structure of my user dataframe?

Related

How do I find common characters in a list of dataframes?

I have about 70 dataframes in a list, each of them has a column named SNP. I want to find the common SNPs that exist in all dataframes. This is the code I used:
setwd("~")
library(data.table)
files <- list.files()
dflist <- list()
for(i in 1:length(files)){
dflist[[i]] <- fread(files[i])
}
map(dflist, ~.$SNP) %>%
reduce(intersect)
However, this returns the following message:
character(0)
list(structure(list(`10:103391446` = c("10:115562764:TTTC_",
"10:115562765:TTC_T", "10:14188623_CCTGA_C", "10:15988900:G_GGT"
)), row.names = c(NA, -4L), class = c("data.table", "data.frame"
)), structure(list(SNP = c("rs34394051",
"rs11121177", "rs10799615", "rs590013")), row.names = c(NA, -4L
), class = c("data.table", "data.frame")),
structure(list(SNP = c("rs34394051", "rs11121177", "rs10799615",
"rs590013")), row.names = c(NA, -4L), class = c("data.table",
"data.frame")))
Can you help please?
Your problems appear to be two-fold:
One of your frames is missing SNP as a column name. That will often cause problems:
setdiff(mtcars$QUUX, mtcars$cyl)
# NULL
This is not hard to fix (names(dflist[[1]]) <- "SNP"), but does not resolve all of the problems.
Your first frame has completely different-looking data. When I skip the first frame, it works.
map(dflist[-1], ~.$SNP) %>%
reduce(intersect)
# [1] "rs34394051" "rs11121177" "rs10799615" "rs590013"

How to build a new list base on another file and sort it in a certain way

I have a list that contain multiple files, that looks like this:
Now I have a df that looks like this:
structure(list(Order = c(1, 2, 3, 4), Data = c("Bone Scan", "Brain Scan",
"", "Cancer History")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
How can I build a new data list which only contain the data that is in df$Data and stored in the order that appears in df?
Try to subset datalist using df$Data. It should give data in the same order as df$Data.
result <- datalist[df$Data]
We can also use pluck
library(purrr)
datalist %>%
pluck(df$Data)

row.name using `structure` function as c(NA, *integer*)

Does anyone know why when I run this:
row.names(structure(list(speed = c(4, 7), dist = c(2, 22)),
row.names = c(NA, 2L), class = "data.frame"))
I get this:
# "1" "2"
and not c(NA, 2L)? I mean what row.names argument in structure exactly does to the argument?
I came across this when I tried to use dput to see the structure of some dataframes. e.g.
dput(cars)
And I noticed the row.names argument in it, which is: c(NA,
-50L).
c(NA, n) is how data frames internally store the row names in the common case of 1:n so as to save space and processing time. This is not supposed to be accessible to the user who is to regard it as "1", "2", ... so the accessor functions translate it.

fuzzy matching in DNA seqs

For the purposes of the reprex I've generated a tibble called random_DNA_tbl that is a random selection of 10 DNA sequences (of 100 bases). I've got a separate tibble called subseq_tbl, with 3 shorter sequences that match 100% to 3 of the sequences in random_DNA_tbl, but I'd also like to use fuzzy matching of sequences from subseq_tbl to other sequences in random_DNA_tbl. I was hopping to be able to use the fuzzyjoin package stringdist_XX_join functions, however these don't seem to work, even though the subseq sequences are actually perfect matches and do work with other matching functions, e.g. regex_XX_join.
library(tidyverse)
library(fuzzjoin)
random_DNA_tbl <- structure(list(random_name = c("random_seq_1", "random_seq_2",
"random_seq_3", "random_seq_4", "random_seq_5", "random_seq_6",
"random_seq_7", "random_seq_8", "random_seq_9", "random_seq_10"
), random_seq = c("CTCCAGTATTAGTCAATGATAAGGGCGAAGGAGCAGTTCTGATATCTCTGTGAAGTAGCATGCGTCTGACTCTCGGGCGCGGCGGAAGACCGAGGAGCGC",
"TTTTCGTCCGACAGAACATCATATAAACTCGATTTAATCTTCTTTTCAAAATCAATTCGAGGGCACCCGATGCGCGTACTGTCAACCATCAAGATAACGA",
"GAATAGTGTACCAGGTCTTATAGTATGTTCATTCGTACAAAAGGATCCAAAACCAATAGGAACCGCTTCTCCCAACAAGCCTGCTCCTTGCAGAGTGAGT",
"GTGACGCCAGATTCTTGACCTGAACCCAGTTCTACCCCCCCAAAACGATCTGGCTTCCGCTCTCTAATGACAGCTATATTGCTTGATAGAGATCGGTAGG",
"ACCGCCTTCCGTAGGTGAACAACCAGCCTCCTGCGGCCAGGGAAGAAGTCGTGGCCTTGGTTAATTTTGGGTTACTAAACGGACACCCACCGTGGCTCAC",
"ACGACTATCAAGACAACTTGTCTCAGAGCTTCACGCACCAACCCCTAACCCAGCAACTCCAGGGCATTGCCACTCTATGATTCGGCGCGGGTGCGCCCTC",
"GGTAGCACTGAGATCAGCCACTATCAAGGTGCTCCTCACTTCTGGTTCTCAGGTTGCGGGCCGATCATTTTTCTCCGAATTAGCGGTCTTTCACGTCAGA",
"CACTGAATAGTCAGCGTAAAGGCGTCAATCTGTCAGCTCGACGGCAGAAGATGTCCAGCGTGCAGTTTCATAGGCGCCCCGGGGAACCTTCTGTGAGAAT",
"GCCTCTTAATTCTTGAACCGCGAGAGGACACAGTGAGATCTGTTCCATTTCCCCCGTTGCCCGCATGGATCGCCCAGACTCTAGACTTAGTGTGACCTTT",
"CGGTATCGGATTGGTCTACGAATCCGCGACCCTCAAGGTTATTTCTGGATGGAGTTCCGTGCTCGCCTGGATGCACTGCCCAAGCAATTAGGACGAAGTA"
)), .Names = c("random_name", "random_seq"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
subseq_tbl <- structure(list(subseq_name = c("subseq1", "subseq2", "subseq3"
), subseq = c("TCAACCATCAAGATAAC", "TAGCGGTCTTTCACGT", "AAGGATCC"
)), .Names = c("subseq_name", "subseq"), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
Doesn't work:
stringdist_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
Does work:
regex_left_join(random_DNA_tbl, subseq_tbl, by = c(random_seq = "subseq"))
I've tried tweaking the max_dist parameter in stringdist but to no avail. Can anyone shed any light on the problem please?

R convert dataframe to JSON

I have a dataframe that I'd like to convert to json format:
my data frame called res1:
library(rjson)
structure(list(id = c(1, 2, 3, 4, 5), value = structure(1:5, .Label = c("server1",
"server2", "server3", "server4", "server5"), class = "factor")), .Names = c("id",
"value"), row.names = c(NA, -5L), class = "data.frame")
when I do:
toJSON(res1)
I get this:
{"id":[1,2,3,4,5],"value":["server1","server2","server3","server4","server5"]}
I need this json output to be like this, any ideas?
[{"id":1,"value":"server1"},{"id":2,"value":"server2"},{"id":3,"value":"server3"},{"id":4,"value":"server4"},{"id":5,"value":"server5"}]
The jsonlite package exists to address exactly this problem: "A practical and consistent mapping between JSON data and R objects."
Its toJSON function provides this desired result with the default options:
library(jsonlite)
x <- toJSON(res1)
cat(x)
## [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
## {"id":3,"value":"server3"},{"id":4,"value":"server4"},
## {"id":5,"value":"server5"}]
How about
library(rjson)
x <- toJSON(unname(split(res1, 1:nrow(res1))))
cat(x)
# [{"id":1,"value":"server1"},{"id":2,"value":"server2"},
# {"id":3,"value":"server3"},{"id":4,"value":"server4"},
# {"id":5,"value":"server5"}]
By using split() we are essentially breaking up the large data.frame into a separate data.frame for each row. And by removing the names from the resulting list, the toJSON function wraps the results in an array rather than a named object.
Now you can easily just call jsonlite::write_json() directly on the dataframe.
You can also use library(jsonify)
jsonify::to_json( res1 )
# [{"id":1.0,"value":"server1"},{"id":2.0,"value":"server2"},{"id":3.0,"value":"server3"},{"id":4.0,"value":"server4"},{"id":5.0,"value":"server5"}]

Resources