Weird conversion from list to dataframe in R - r

I have a list that I created from a for loop and it looks like this:
I tried to convert it to a dataframe using the code:
dflist<- as.data.frame(mylist)
But my dataframe looks like this now:
I know I probably created my list wrong but I am thinking this is still salvageable if I just need to convert the numbers to a dataframe correctly.
My end goal is to plot the numbers against their index (1-30) and I thought creating a dataframe first to clean it up and then plot would be helpful.
Any help would be really appreciated. Thank you.

The data showed is a list. We can use unlist and create a data.frame. Based on the image showed in OP's post, each list element have a length of 1. By doing unlist, we convert the list to vector and then wrap with data.frame.
data.frame(ind= seq_along(lst), Col1= as.numeric(unlist(lst)))
Or another option would be stack after naming the list elements
df1 <- transform(stack(setNames(lst, seq_along(lst))),
values = as.numeric(values))
It gives a two column dataset. From this we can do the plotting
Regarding the OP's approach about calling as.data.frame directly on the list, it does work in a different way as it calls on as.data.frame.list. For example, if we do as.data.frame on a vector, it uses as.data.frame.vector
as.data.frame(1:5)
# 1:5
#1 1
#2 2
#3 3
#4 4
#5 5
But, if we call as.data.frame.list
as.data.frame.list(1:5)
# X1L X2L X3L X4L X5L
#1 1 2 3 4 5
we get a data.frame with 'n' columns (based on the length of the vector).
Suppose, we do the same on a list
as.data.frame(as.list(1:5))
# X1L X2L X3L X4L X5L
#1 1 2 3 4 5
It uses the as.data.frame.list. To get the complete list of methods of as.data.frame,
methods('as.data.frame')
#[1] as.data.frame.aovproj* as.data.frame.array
# [3] as.data.frame.AsIs as.data.frame.character
# [5] as.data.frame.chron* as.data.frame.complex
# [7] as.data.frame.data.frame as.data.frame.data.table*
# [9] as.data.frame.Date as.data.frame.dates*
#[11] as.data.frame.default as.data.frame.difftime
#[13] as.data.frame.factor as.data.frame.ftable*
#[15] as.data.frame.function* as.data.frame.grouped_df*
#[17] as.data.frame.idf* as.data.frame.integer
#[19] as.data.frame.ITime* as.data.frame.list <-------
#[21] as.data.frame.logical as.data.frame.logLik*
#[23] as.data.frame.matrix as.data.frame.model.matrix
#[25] as.data.frame.noquote as.data.frame.numeric
#[27] as.data.frame.numeric_version as.data.frame.ordered
#[29] as.data.frame.POSIXct as.data.frame.POSIXlt
#[31] as.data.frame.raw as.data.frame.rowwise_df*
#[33] as.data.frame.table as.data.frame.tbl_cube*
#[35] as.data.frame.tbl_df* as.data.frame.tbl_dt*
#[37] as.data.frame.tbl_sql* as.data.frame.times*
#[39] as.data.frame.ts as.data.frame.vector

Related

Rename row.name in data frame using matches or partial matches from a list

I have a data frame in R with 341 rows. I want to rename the row names using a list with 349 names. All 341 names will be in this list for sure. But not all of them will be perfect hits.
The data looks like this
rownames(df_RPM1)
[1] "LQNS02059392.1_11686_5p"
[2] "LQNS02277998.1_30984_3p"
[3] "LQNS02277998.1_30984_5p"
[4] "LQNS02277998.1_30988_3p"
[5] "LQNS02277998.1_30988_5p"
[6] "LQNS02277997.1_30943_3p"
[7] "miR-9|LQNS02278070.1_31740_3p"
[8] "miR-9|LQNS02278094.1_36129_3p"
head(inlist)
[1] "dpu-miR-2-03_LQNS02059392.1_11686_5p" "dpu-miR-10-P2_LQNS02277998.1_30984_3p"
[3] "dpu-miR-10-P2_LQNS02277998.1_30984_5p" "dpu-miR-10-P3_LQNS02277998.1_30988_3p"
[5] "dpu-miR-10-P3_LQNS02277998.1_30988_5p" "miR-9|LQNS02278070.1_31740_3p"
[6] "miR-9|LQNS02278094.1_36129_3p"
The order won't necessarily be the same in the two.
Can anyone suggest me how to do this in R?
Thanks a lot
Depends a lot what a "non-perfect hit" looks like. Assuming the row name is a substring of the real name, str_detect() does the job quite well:
library(tidyverse)
real_names <- c("dpu-miR-2-03_LQNS02059392.1_11686_5p",
"dpu-miR-10-P2_LQNS02277998.1_30984_3p",
"dpu-miR-10-P2_LQNS02277998.1_30984_5p",
"dpu-miR-10-P3_LQNS02277998.1_30988_3p",
"dpu-miR-10-P3_LQNS02277998.1_30988_5p",
"miR-9|LQNS02278070.1_31740_3p",
"miR-9|LQNS02278094.1_36129_3p")
str_which(real_names, "LQNS02059392.1_11686_5p")
#> [1] 1
So we can vectorize (I removed the element 6 which is not found in the example list):
pos <- map_int(rownames(df_RPM1), ~ str_which(real_names, fixed(.)))
pos
#> [1] 1 2 3 4 5 6 7
And all that's left is to change the row names:
rownames(df_RPM1) <- real_names[pos]
Of course, if a non-perfect hit means something more complicated, you may need to create a regex from the row names or something like that.

How to concatenate (merge) AAStringSets by name?

In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.
Say these are two multiple sequence alignments.
library(Biostrings)
set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")
set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")
set1
A AAStringSet instance of length 3
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
set2
A AAStringSet instance of length 3
width seq names
[1] 3 VRT org_2
[2] 3 RKG org_3
[3] 3 AST org_4
The correct concatenation of these sequences would be
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.
I thought there would be a function to do this in BioStrings, MSA, DECIPHER, or other related packages, but have been unable to find one.
I found the following Q&As, each does not provide the desired output as described.
1: https://support.bioconductor.org/p/38955/
output
A AAStringSet instance of length 6
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
[4] 3 VRT org_2
[5] 3 RKG org_3
[6] 3 AST org_4
May be better described as 'appending' the sequences (joins the two sets vertically).
2: https://support.bioconductor.org/p/39878/
output
A AAStringSet instance of length 2
width seq
[1] 9 IVRRDGLKS
[2] 9 VRTRKGAST
Concatenates sequences in each set, a complete chimera of each set (certainly not desired).
3: How to concatenate two DNAStringSet sequences per sample in R?
output
A AAStringSet instance of length 3
width seq
[1] 6 IVRVRT
[2] 6 RDGRKG
[3] 6 LKSAST
Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)
4: https://www.biostars.org/p/115192/
Output
A AAStringSet instance of length 2
width seq
[1] 3 IVR
[2] 3 VRT
Only appends the first sequence from each set, not sure why anyone wants this...
I would normally think these kinds of processes would be done with some combination of bash and Python, but I'm using the DECIPHER multiple sequence aligner in R, so it makes sense to do the rest of the processing in R. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!
So I am a somewhat fanatical user of data.table in R, among many things it is great to merge datasets by names. I found Biostrings::AAStringSets can be converted to matrices using as.matrix and these can be converted to data.table and merged.
set1.dt<-data.table(as.matrix(set1), keep.rownames = TRUE)
set2.dt<-data.table(as.matrix(set2), keep.rownames = TRUE)
set12.dt<-merge(set1.dt, set2.dt, by="rn", all=TRUE)
set12.dt
rn V1.x V2.x V3.x V1.y V2.y V3.y
1: org_1 I V R <NA> <NA> <NA>
2: org_2 R D G V R T
3: org_3 L K S R K G
4: org_4 <NA> <NA> <NA> A S T
This is the correct merge, but needs more work to get the final result.
Need to replace "NA" with "-". I always need to look up this question to remember the best way to do this with a data.table.
Fastest way to replace NAs in a large data.table
#slightly modified from original, added arg "x"
f_dowle = function(dt, x) { # see EDIT later for more elegant solution
na.replace = function(v,value=x) { v[is.na(v)] = value; v }
for (i in names(dt))
eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}
f_dowle(set12.dt, "-")
Concatenate the sequences (not included the names with !"rn")
set12<-apply(set12.dt[ ,!"rn"], 1, paste, collapse="")
Convert back to AAStringSet and add back names
set12<-AAStringSet(set12)
names(set12)<-set12.dt$rn
Desired output
set12
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
This works, but seems quite cumbersome, especially converting between different data formats. Obviously can wrap it into a function to use more easily, but again seems like this should already be a function in some Bioconductor package...

Element wise concatenation of nested list [duplicate]

This question already has answers here:
Paste multiple columns together
(11 answers)
Closed 4 years ago.
I have a nested list
l1 <- letters
l2 <- 1:26
l3 <- LETTERS
list <- list(l1,l2,l3)
Is there an elegant way to concatenate all the elements in inner vectors to form one character vector (possibly using paste), the assumption is that all the inner vectors are of the same length.
I would like my final result to be
[1] "a1A"
[2] "b2B"
[3] "c3C"
[4] "d4D"
....
[26] "z26Z"
Try:
apply(sapply(list,paste0),1,paste0,collapse="")
[1] "a1A" "b2B" "c3C" "d4D" "e5E" "f6F" "g7G" "h8H" "i9I" "j10J" "k11K" "l12L" "m13M" "n14N" "o15O" [16] "p16P" "q17Q" "r18R" "s19S" "t20T" "u21U" "v22V" "w23W" "x24X" "y25Y" "z26Z"
user20650's solution is probably as elegant as you are going to get. But for what it's worth, here's a quick hack in dplyr:
library(dplyr)
ll <- list(l1,l2,l3) # I try not to use "list" as a name. Gets confusing sometimes.
as.data.frame(ll) %>%
mutate(x = paste0(.[[1]], .[[2]], .[[3]])) %>%
.$x
# returns
[1] "a1A" "b2B" "c3C" "d4D" "e5E" "f6F" "g7G" "h8H" "i9I" "j10J" "k11K" "l12L"
[13] "m13M" "n14N" "o15O" "p16P" "q17Q" "r18R" "s19S" "t20T" "u21U" "v22V" "w23W" "x24X"
[25] "y25Y" "z26Z"

Unknown format in R, how to convert it into rows and columns [duplicate]

This question already has answers here:
How to split item names when writing csv file of scraped data
(2 answers)
Closed 5 years ago.
I scraped something from the web that gives me something like this:
[1] "(Wirtschaft, 00:00)" "(Kultur, 23:42)" "(Sport, 23:38)" "(Politik, 23:16)"
[5] "(Sport, 22:29)" "(Panorama, 21:56)" "(Sport, 21:39)" "(Sport, 21:25)"
[9] "(Sport, 20:23)" "(Politik, 20:21)" "(Politik, 20:09)" "(Wissenschaft, 19:41)"
[13] "(Politik, 18:43)" "(Sport, 18:16)" "(Politik, 17:53)" "(Wirtschaft, 17:41)"
[17] "(Politik, 17:37)" "(Sport, 17:28)" "(Sport, 17:09)" "(Sport, 17:07)"
What I am wondering now is the following. How is R seeing this? I simply want to have observations(rows) and variables(columns) now. However, when I use ncol() or nrow() it shows NULL. Can someone tell me how I can manipulate the date so that I have rows and columns. I know there is the separate function and all that but everybody explains it so difficult that you need 5 years experience to understand it. Please help a beginner to learn. Thanks
One solution could be with following steps:
# Data
v <- c("(Wirtschaft, 00:00)", "(Kultur, 23:42)", "(Sport, 23:38)","(Politik, 23:16)",
"(Sport, 22:29)","(Panorama, 21:56)","(Sport, 21:39)", "(Sport, 21:25)",
"(Sport, 20:23)","(Politik, 20:21)","(Politik, 20:09)",
"(Wissenschaft, 19:41)","(Politik, 18:43)")
# Solution
library(dplyr)
library(tidyr)
x <- gsub("\\(|\\)", "", v, perl = T) %>% as.data.frame()
colnames(x) <- "Heading"
separate(x, "Heading", c("Item", "Time"), sep = ",")
Item Time
1 Wirtschaft 00:00
2 Kultur 23:42
3 Sport 23:38
4 Politik 23:16
5 Sport 22:29
6 Panorama 21:56
7 Sport 21:39
8 Sport 21:25
9 Sport 20:23
10 Politik 20:21
11 Politik 20:09
12 Wissenschaft 19:41
Here is a solution to transform the vector of strings you've shown into a data.frame, a structure with rows and columns:
# Your current vector
scraped <- c("(Wirtschaft, 00:00)", "(Kultur, 23:42)", "(Sport, 23:38)", "(Politik, 23:16)")
Here I've just recreated a sample of your data, here it is:
> scraped
[1] "(Wirtschaft, 00:00)" "(Kultur, 23:42)"
[3] "(Sport, 23:38)" "(Politik, 23:16)"
Now I'm creating a function that will remove the brackets and commas from each element in this vector:
# Create a function to clean each element of the vector
clean <- function(x) {
# Replace brackets with blank strings
no_brackets <- gsub("[()]", "", x)
# Split the string at the comma
split <- strsplit(no_brackets, ", ")[[1]]
return(split)
}
You can see how this works on a single element of your vector:
> clean(scraped[1])
[1] "Wirtschaft" "00:00"
It has taken "(Wirtschaft, 00:00)" and separated that one element into two, while removing the brackets and comma.
Next, I apply this function to every element of scraped using the function sapply:
# Apply the clean function to each element of your vector
mat <- sapply(scraped, clean)
Now we have a matrix:
> mat
(Wirtschaft, 00:00) (Kultur, 23:42) (Sport, 23:38) (Politik, 23:16)
[1,] "Wirtschaft" "Kultur" "Sport" "Politik"
[2,] "00:00" "23:42" "23:38" "23:16"
So this is now in a rows and columns format. However, it's more common to have variables of the same type in the same column, and each observation illustrated by a row, ie the other way up. It's also more useful to have them in the data structure named a data.frame rather than a matrix. So in this final step, I will transpose the matrix with the t function and convert it to a dataframe with the data.frame function:
# Transpose the matrix and convert it to a data.frame
df <- data.frame(t(mat), stringsAsFactors=FALSE)
Now the dataset is a data.frame that looks like this:
> df
X1 X2
(Wirtschaft, 00:00) Wirtschaft 00:00
(Kultur, 23:42) Kultur 23:42
(Sport, 23:38) Sport 23:38
(Politik, 23:16) Politik 23:16
You can access different values in the data.frame with the syntax df[row, column]:
> df[1, 1] # The first row and first column of df
[1] "Wirtschaft"
> df[3, 2] # The third row and second column of df
[1] "23:38"

Saving dates in a matrix ("origin must be supplied") with r

I am writing my bachelor thesis and I have not much experience with r so far.
My problem is that my dates which I made with this commands :
t<-strptime(x, "%d.%m.%Y %H.%M")
don't work anymore when I save them in a matrix with the other information on those specific dates.
I am a bit confused because it works just fine when I don't put them in a matrix like this t[1:10]
But that happens as soon as I try to save them in a matrix
matrix1<-matrix(c(t,v2,v3,v4),nrow=length(v2))
Fehler in as.POSIXct.numeric(X[[i]], ...) : 'origin' muss angegeben werden
It's German but it means origin must be supplied.
Any ideas what I have to do to fix it? I am a bit frustrated :)
Roland is right. You can't have Posixlt objects in a matrix. What you can do is save those dates as numeric timestamps in the matrix and convert them back to dates while accessing
Converting to numeric timestamp:
>date<- as.numeric(as.POSIXct("2014-02-16 2:13:46 UTC",origin="01-01-1970"))
>date
[1] 1392545626
Then save those timestamps in a matrix as you do and to convert it back to date, use the above command again without converting it into a numeric.
t (terrible name by the way, easily confused with the t function) is a POSIXlt object, which internally is a list. First you should check, what c(t,v2,v3,v4) returns (I don't know how v2 etc are defined).
Then we can look into the documentation in help("matrix"):
data
an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.
The important bit is "all attributes discarded". This is what you get if you discard the attributes (which include the class attribute) of a POSIXlt object:
x <- strptime(c("2016-05-09 12:00:00", "2016-05-09 13:00:00"), format = "%Y-%m-%d %H:%M:%S")
attributes(x) <- NULL
print(x)
# [[1]]
# [1] 0 0
#
# [[2]]
# [1] 0 0
#
# [[3]]
# [1] 12 13
#
# [[4]]
# [1] 9 9
#
# [[5]]
# [1] 4 4
#
# [[6]]
# [1] 116 116
#
# [[7]]
# [1] 1 1
#
# [[8]]
# [1] 129 129
#
# [[9]]
# [1] 1 1
#
# [[10]]
# [1] "CEST" "CEST"
#
# [[11]]
# [1] NA NA
A matrix can't contain POSIXlt objects (or any objects, i.e., anything with an explicit class).

Resources