Converting comma separated list to dataframe - r

If I have a list similar to x <- c("Name,Age,Gender", "Rob,21,M", "Matt,30,M"), how can I convert to a dataframe where Name, Age, and Gender become the column headers.
Currently my approach is,
dataframe <- data.frame(matrix(unlist(x), nrow=3, byrow=T))
which gives me
matrix.unlist.user_data...nrow...num_rows..byrow...T.
1 Name,Age,Gender
2 Rob,21,M
3 Matt,30,M
and doesn't help me at all.
How can I get something which resembles the following from the list mentioned above?
+---------------------------------------------+
| name | age | gender |
| | | |
+---------------------------------------------+
| | | |
| | | |
| ... | ... | ... |
| | | |
| | | ++
+---------------------------------------------+
| | | |
| ... | ... | ... |
| | | |
| | | |
+---------------------------------------------+

We paste the strings into a single string with \n and use either read.csv or read.table from base R
read.table(text=paste(x, collapse='\n'), header = TRUE, stringsAsFactors = FALSE, sep=',')

Alternatively,
data.table::fread(paste(x, collapse = "\n"))
Name Age Gender
1: Rob 21 M
2: Matt 30 M

Related

R, get entires from one column based on values of another column in R

I am trying to get column entries as a list that match a list of entries from data frame
Showing what I am trying to do:
Dataframe named Tepo
| | name | shortcut |
| -------- | -------------- | ----------|
| 1 | Apples | A |
| 2 | Bannans | B |
| 3 | oranges | O |
| 4 | Carrots | C |
| 5 | Mangos | M |
| 6 | Strawberies | S |
I have a list FruitList as chr
>FruitList
>[1] "Bannas" "Carrots" "Mangos"
And I would like to get a list, shortcutList, of the corresponding columns:
>shortcutList
>[1] "B" "C" "M"
My attempt:
shortcutList <- tepo$shorcut[tepo$name == FruiteList[]]
However, I don't get the desired list output.
Thanks for the help
Use %in% :
shortcutList <- tepo$shortcut[tepo$name %in% FruitList]

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

For each combination of a set of variables in a list, calculating correlations between this combination and another variable in R

In R I want to generate correlation co-efficients by comparing 2 variables whilst also retaining a phylogenetic signal.
The initial way I thought to do this is not computationally efficient, and I think there is a much simpler, but I do not have the skills in R to do it.
I have a csv file which looks like this:
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
| Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159 |
| Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984 |
| Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289 |
| Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987 |
| Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148 |
+-------------------------------+-----+----------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+
What I want to do is, for each possible combination of the percentages within the 20 single letter columns (amino acids, so 10 million combinations). Is to calculate the correlation between each different combination and the OGT variable in the CSV.... (whilst retaining a phylogenetic signal)
My current code is this:
library(parallel)
library(dplyr)
library(tidyr)
library(magrittr)
library(ape)
library(geiger)
library(caper)
taxonomynex <- read.nexus("taxonomyforzeldospecies.nex")
zeldodata <- read.csv("COMPLETECOPYFORR.csv")
Species <- dput(zeldodata)
SpeciesLong <-
Species %>%
gather(protein, proportion,
A:Y) %>%
arrange(Species)
S <- unique(SpeciesLong$protein)
Scombi <- unlist(lapply(seq_along(S),
function(x) combn(S, x, FUN = paste0, collapse = "")))
joint_protein <- function(protein_combo, data){
sum(data$proportion[vapply(data$protein,
grepl,
logical(1),
protein_combo)])
}
SplitSpecies <-
split(SpeciesLong,
SpeciesLong$Species)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, c("Scombi", "joint_protein"))
SpeciesAggregate <-
parLapply(cl,
X = SplitSpecies,
fun = function(data){
X <- lapply(Scombi,
joint_protein,
data)
names(X) <- Scombi
as.data.frame(X)
})
Species <- cbind(Species, SpeciesAggregate)
`
Which attempts to feed in each combination into memory and then calculate the sum of each proportion of each of the acids, but this takes forever to finish and crashes before completion.
I think it would be better to feed in correlation co-efficents into a vector, and then just print out the relative co-efficients of each different combination for each species, but I don't know the best way of doing this in R.
I also aim to retain a phylogenetic signal using the ape package using something along the lines of this:
pglsModel <- gls(OGT ~ AminoAcidCombination, correlation = corBrownian(phy = taxonomynex),
data = zeldodata, method = "ML")
summary(pglsModel)
Apologies for how unclear this is, if anyone has any advice, much appreciated!
Edit: Link to taxonomyforzeldospecies.nex
Output from dput(Zeldodata):
1 Species OGT Domain A C D E F G H I K L M N P Q R S T V W Y
------------------------------- ----- ---------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- -------------- --------------
2 Aeropyrum pernix 95 Archaea 9.7659115711 0.6720465616 4.3895390781 7.6501943794 2.9344881615 8.8666657183 1.5011817208 5.6901432494 4.1428307243 11.0604191603 2.21143353 1.9387130928 5.1038552753 1.6855017182 7.7664358772 6.266067034 4.2052190807 9.2692433532 1.318690698 3.5614200159
3 Argobacterium fabrum 26 Bacteria 11.5698896021 0.7985475923 5.5884500155 5.8165463343 4.0512504104 8.2643271309 2.0116736244 5.7962804605 3.8931525401 9.9250463349 2.5980609708 2.9846761128 4.7828063605 3.1262365491 6.5684282943 5.9454781844 5.3740045968 7.3382308193 1.2519739683 2.3149400984
4 Anaeromyxobacter dehalogenans 27 Bacteria 16.0337898849 0.8860252895 5.1368827707 6.1864992608 2.9730203513 9.3167603253 1.9360386851 2.940143349 2.3473650439 10.898494736 1.6343905351 1.5247123262 6.3580285706 2.4715303021 9.2639057482 4.1890063803 4.3992339725 8.3885969061 1.2890166336 1.8265589289
5 Aquifex aeolicus 85 Bacteria 5.8730327277 0.795341216 4.3287799008 9.6746388172 5.1386954322 6.7148035486 1.5438364179 7.3358775924 9.4641440609 10.5736658776 1.9263080969 3.6183861236 4.0518679067 2.0493569604 4.9229955632 4.7976564501 4.2005259246 7.9169763709 0.9292167138 4.1438942987
6 Archaeoglobus fulgidus 83 Archaea 7.8742687687 1.1695110027 4.9165979364 8.9548767369 4.568636662 7.2640358917 1.4998752909 7.2472039919 6.8957233203 9.4826333048 2.6014466253 3.206476915 3.8419576418 1.7789787933 5.7572748236 5.4763351139 4.1490633048 8.6330814159 1.0325605451 3.6494619148
this will give you a long data frame with each combination and sum per Species (takes about 35 seconds on my machine)...
zeldodata <-
Species %>%
gather(protein, proportion, A:Y) %>%
group_by(Species) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion)
an example of calculating each species separately and saving the data to disk before reading each one in and combining them...
library(readr)
library(dplyr)
library(tidyr)
library(purrr)
# read in CSV file
zeldodata <-
read_delim(
delim = "|",
trim_ws = TRUE,
col_names = TRUE,
col_types = "cicdddddddddddddddddddd",
file = "Species | OGT | Domain | A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y
Aeropyrum pernix | 95 | Archaea | 9.7659115711 | 0.6720465616 | 4.3895390781 | 7.6501943794 | 2.9344881615 | 8.8666657183 | 1.5011817208 | 5.6901432494 | 4.1428307243 | 11.0604191603 | 2.21143353 | 1.9387130928 | 5.1038552753 | 1.6855017182 | 7.7664358772 | 6.266067034 | 4.2052190807 | 9.2692433532 | 1.318690698 | 3.5614200159
Argobacterium fabrum | 26 | Bacteria | 11.5698896021 | 0.7985475923 | 5.5884500155 | 5.8165463343 | 4.0512504104 | 8.2643271309 | 2.0116736244 | 5.7962804605 | 3.8931525401 | 9.9250463349 | 2.5980609708 | 2.9846761128 | 4.7828063605 | 3.1262365491 | 6.5684282943 | 5.9454781844 | 5.3740045968 | 7.3382308193 | 1.2519739683 | 2.3149400984
Anaeromyxobacter dehalogenans | 27 | Bacteria | 16.0337898849 | 0.8860252895 | 5.1368827707 | 6.1864992608 | 2.9730203513 | 9.3167603253 | 1.9360386851 | 2.940143349 | 2.3473650439 | 10.898494736 | 1.6343905351 | 1.5247123262 | 6.3580285706 | 2.4715303021 | 9.2639057482 | 4.1890063803 | 4.3992339725 | 8.3885969061 | 1.2890166336 | 1.8265589289
Aquifex aeolicus | 85 | Bacteria | 5.8730327277 | 0.795341216 | 4.3287799008 | 9.6746388172 | 5.1386954322 | 6.7148035486 | 1.5438364179 | 7.3358775924 | 9.4641440609 | 10.5736658776 | 1.9263080969 | 3.6183861236 | 4.0518679067 | 2.0493569604 | 4.9229955632 | 4.7976564501 | 4.2005259246 | 7.9169763709 | 0.9292167138 | 4.1438942987
Archaeoglobus fulgidus | 83 | Archaea | 7.8742687687 | 1.1695110027 | 4.9165979364 | 8.9548767369 | 4.568636662 | 7.2640358917 | 1.4998752909 | 7.2472039919 | 6.8957233203 | 9.4826333048 | 2.6014466253 | 3.206476915 | 3.8419576418 | 1.7789787933 | 5.7572748236 | 5.4763351139 | 4.1490633048 | 8.6330814159 | 1.0325605451 | 3.6494619148"
)
# save an RDS file for each species
for(species in unique(zeldodata$Species)) {
zeldodata %>%
filter(Species == species) %>%
gather(protein, proportion, A:Y) %>%
mutate(combo = sapply(1:n(), function(i) combn(protein, i, FUN = paste0, collapse = ""))) %>%
mutate(sum = sapply(1:n(), function(i) combn(proportion, i, FUN = sum))) %>%
unnest() %>%
select(-protein, -proportion) %>%
saveRDS(file = paste0(species, ".RDS"))
}
# read in and combine all the RDS files
zeldodata <-
list.files(pattern = "\\.RDS") %>%
map(read_rds) %>%
bind_rows()

Data Cleanup with R: Getting rid of extra fullstops

I am cleaning the data using R.
Below is my data format
Input
1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |
expected output:
1) 100 | 101.25 | 102.25 |NA | NA |201.5
2) 200.05|200.26 | 205 |NA | NA |3000
3) 300.98|300.26 |2001.26 |NA |0.2 |5.65
there are extra full stops at in the table, which I am trying to get cleaned, but to retain decimal numbers in its format
I tried replace all in R, which clears all the full stops, and decimal numbers are distorted.
If the trailing full stop is really the only manifestation of the problem, then you may try just removing it with sub:
x <- c("101.25", "200.56.", "300.26")
x <- sub("\\.$", "", x)
You can use look-ahead to replace dot(.) which are not before space or | as:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'
y <- gsub("([.]+)(?=[[:blank:]|])","",x,perl = TRUE)
cat(y)
# 1) 100 | 101.25 | 102.25 | | | 201.5 |
# 2) 200.05 | 200.56 | 205 | | | 3000 |
# 3) 300.98 | 300.26 | 2001.56| | 0.2| 5.65 |
Regex explanation:
([.]+) - Group any number of . before look-ahead
(?=[[:blank:]|]) - Look-ahead before :blank: or |
Data:
x <- '1) 100 | 101.25 | 102.25. | . | .. | 201.5. |
2) 200.05. | 200.56. | 205 | .. | . | 3000 |
3) 300.98 | 300.26. | 2001.56.| ... | 0.2| 5.65. |'

Extracting columns from text file

I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command).
I use the following command to load the text file:
data3 <-read.table (file.choose(), header = FALSE,sep = ",")
I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result).
Each COLn will contain the relevant characters of the tree in this example.
How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file?
Here is the text file content:
[[1]]
J48 pruned tree
------------------
MSTV <= 0.4
| MLTV <= 4.1: 3 -2
| MLTV > 4.1
| | ASTV <= 79
| | | b <= 1383:00:00 2 -18
| | | b > 1383
| | | | UC <= 05:00 1 -2
| | | | UC > 05:00 2 -2
| | ASTV > 79:00:00 3 -2
MSTV > 0.4
| DP <= 0
| | ALTV <= 09:00 1 (170.0/2.0)
| | ALTV > 9
| | | FM <= 7
| | | | LBE <= 142:00:00 1 (27.0/1.0)
| | | | LBE > 142
| | | | | AC <= 2
| | | | | | e <= 1058:00:00 1 -5
| | | | | | e > 1058
| | | | | | | DL <= 04:00 2 (9.0/1.0)
| | | | | | | DL > 04:00 1 -2
| | | | | AC > 02:00 1 -3
| | | FM > 07:00 2 -2
| DP > 0
| | DP <= 1
| | | UC <= 03:00 2 (4.0/1.0)
| | | UC > 3
| | | | MLTV <= 0.4: 3 -2
| | | | MLTV > 0.4: 1 -8
| | DP > 01:00 3 -8
Number of Leaves : 16
Size of the tree : 31
An example of the COL1 content will be:
MSTV
|
|
|
|
|
|
|
|
MSTV
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
COL2 content will be:
MLTV
MLTV
|
|
|
|
|
|
>
DP
|
|
|
|
|
|
|
|
|
|
|
|
DP
|
|
|
|
|
|
Try this:
cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt),
header = FALSE,
widths = rep.int(4, max(nchar(cleaned.txt)/4)),
strip.white= TRUE
)
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. There's probably a more efficient way to do this outside of R, but this isn't so bad. Generally, I'm just making the file readable to R.
Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output.
Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. I'm using some tricks to guess at how long the file is, and removing extraneous columns over here
widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
Next, I'm creating the data in the way you would like it structured.
for (i in colnames(cleaned.df)) {
assign(i, subset(cleaned.df, select=i))
assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))
}
rm(i)
rm(cleaned.df)
rm(cleaned.txt)
What this does is it creates a loop for each column header in your data frame.
From there it uses assign() to put all the data in each column into its' own data frame. In your case, they are named V1 through V15.
Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames.
Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. But because this is a fixed-width file, some columns are completely blank, which I'm removing using
get(i)[get(i)!=""]
(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |).
If we just use get(i), there will be a leading whitespace in the output.

Resources