R: Automatically Creating Graphs from Dataframes - r

I am working with the R programming language.
I have the following data set (route_1):
route_1
id long lat
1 1 -74.56048 40.07051
3 3 -72.44129 41.71506
4 4 -77.53908 41.55434
2 2 -74.23018 40.12929
6 6 -78.68685 42.35981
5 5 -79.26506 43.22408
Based on this data, I want to make a directed network graph in which each row is only linked to the row that comes right after. Using the "igraph" library, I was able to do this manually:
library(igraph)
my_data <- data.frame(
"node_a" = c("1", "3", "4", "2", "6"),
"node_b" = c("3", "4", "2", "6", "5")
)
graph <- graph.data.frame(my_data, directed=TRUE)
graph <- simplify(graph)
plot(graph)
My Question: Is it possible to make this above network graph directly using the "route_1" dataset, and without manually creating a new data set that contains information on which node is connected to what node?
Thanks!

Is the dataset always going to be ordered correctly, so the plot will go from row 1->2->3 etc in a single line? If so, we can make the node info dataframe by simply subsetting the ID column. If we put the steps in a function, it becomes a simple 1-liner:
plot_nodes <- function(x) {
id = x$id
a = id[1:length(id)-1]
b = id[2:length(id)]
graph.data.frame(data.frame(a,b), directed=TRUE)
}
graph <- plot_nodes(route_1)
plot(simplify(graph))

Related

How to match characters from strings in R?

Before posting my question, I would just like to mention that I have looked through the "Similar questions" tab, and have not quite found what I am looking for. I found something somewhat similar here, but it is in python. There was also a nice idea here that may help as last resort. In any case, I would like to try first if there is a more straightforward way to do it.
To the problem:
Say there are 2 different data frames: (1) Ref_seq; and (2) Variants:
>Ref_seq
Seq_name AA_seq
1 Ref1 VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ
2 Ref2 SNFPHLVLEKILVSLTMKNCKAAMNFFQ
3 Ref3 RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS
4 Ref4 HCTSVSKEVEGTSYHESLYNALQSLRDR
5 Ref5 DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC
6 Ref6 HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS
7 Ref7 SNILLKDILSVRKYWCEISQQQWLELFSVY
8 Ref8 LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT
9 Ref9 EDQSSMNLFNDYPDSSVSDANEPGESQSTIG
10 Ref10 SLSEKSKEETGISLQDLLLEIYRSIGEPDSL
>Variants
peptideID AA_seq
1 Pep1 QEISALVKYF
2 Pep2 HTGERGNLVT
3 Pep3 NKMTTSVLIK
4 Pep4 SMNLKNDYPD
5 Pep5 NEPGYSQSTI
6 Pep6 NPQDVIMVKL
7 Pep7 MAAKFNKMTL
8 Pep8 RRQKDPSSGT
9 Pep9 QQQWTELFSV
The first data frame contains the amino acid (aa) sequences from a reference organism, whilst the second contains the aa sequences from a test organism. It is known that the sequences from the Variants object contain at least (a) one aa change, (b) 4 matching characters to the reference sequence from Ref_seq, and (c) the matching can be forward or backwards (e.g. aa sequence from line 3 of Variants).
I am trying to find a way to lookup and retrieve to which reference sequence (Seq_name) each peptideID belongs to. The result should look like this:
peptideID AA_seq Seq_name
1 Pep1 QEISALVKYF Ref1
2 Pep2 HTGERGNLVT Ref5
3 Pep3 NKMTTSVLIK Ref2
4 Pep4 SMNLKNDYPD Ref9
5 Pep5 NEPGYSQSTI Ref9
6 Pep6 NPQDVIMVKL Ref6
7 Pep7 MAAKFNKMTL Ref2
8 Pep8 RRQKDPSSGT Ref3
9 Pep9 QQQWTELFSV Ref7
I thought that maybe regex coupled with a loop for each peptideID, considering that the strings change according to it. But I cannot wrap my head around it.
Any help will be very welcome!
Data from the example:
Ref_seq <- data.frame(Seq_name=paste0("Ref",1:10), AA_seq=c("VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ", "SNFPHLVLEKILVSLTMKNCKAAMNFFQ", "RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS", "HCTSVSKEVEGTSYHESLYNALQSLRDR", "DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC", "HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS", "SNILLKDILSVRKYWCEISQQQWLELFSVY", "LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT", "EDQSSMNLFNDYPDSSVSDANEPGESQSTIG", "SLSEKSKEETGISLQDLLLEIYRSIGEPDSL"))
Variants <- data.frame(peptideID=paste0("Pep",1:9), AA_seq=c("QEISALVKYF", "HTGERGNLVT", "NKMTTSVLIK", "SMNLKNDYPD", "NEPGYSQSTI", "NPQDVIMVKL", "MAAKFNKMTL", "RRQKDPSSGT", "QQQWTELFSV"))
You can try something like this. It takes the AAs from var and compares them with the ones from Ref_seq, including reverse matching. It uses agrep for fuzzy matching.
data.frame( var, Seq_name=unlist( sapply( var$AA_seq, function(x){
a <- !anyNA(agrep(x, Ref_seq$AA_seq)[1]);
ifelse( a, Ref_seq[ agrep(x, Ref_seq$AA_seq)[1],],
Ref_seq[ agrep(paste0(rev(strsplit(x, "")[[1]]),
collapse=""),Ref_seq$AA_seq)[1], ] ) } ) ) )
peptideID AA_seq Seq_name
1 Pep1 QEISALVKYF Ref1
2 Pep2 HTGERGNLVT Ref5
3 Pep3 NKMTTSVLIK Ref2
4 Pep4 SMNLKNDYPD Ref9
5 Pep5 NEPGYSQSTI Ref9
6 Pep6 NPQDVIMVKL Ref6
7 Pep7 MAAKFNKMTL Ref2
8 Pep8 RRQKDPSSGT Ref3
9 Pep9 QQQWTELFSV Ref7
Although this works for this example, I would suggest searching for a Bioconductor library that does what you want. There are many tricky situations that those libraries already solve.
Data
Ref_seq <- structure(list(Seq_name = c("Ref1", "Ref2", "Ref3", "Ref4", "Ref5",
"Ref6", "Ref7", "Ref8", "Ref9", "Ref10"), AA_seq = c("VSASTQASRQKKMQEISSLVKYFIKCANRRAPRLKCQ",
"SNFPHLVLEKILVSLTMKNCKAAMNFFQ", "RRQKRPSSGTIFNDAFWLDLNYLEVAKVAQS",
"HCTSVSKEVEGTSYHESLYNALQSLRDR", "DHTGEYGNLVTIQSFKAEFRLAGGVNLPKIIDC",
"HKDQMVDIMRASQDNPQDGIMVKLVVNLLQLS", "SNILLKDILSVRKYWCEISQQQWLELFSVY",
"LTIFLKTLAVNFRIRVCELGDEILPTLLYIWT", "EDQSSMNLFNDYPDSSVSDANEPGESQSTIG",
"SLSEKSKEETGISLQDLLLEIYRSIGEPDSL")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
var <- structure(list(peptideID = c("Pep1", "Pep2", "Pep3", "Pep4",
"Pep5", "Pep6", "Pep7", "Pep8", "Pep9"), AA_seq = c("QEISALVKYF",
"HTGERGNLVT", "NKMTTSVLIK", "SMNLKNDYPD", "NEPGYSQSTI", "NPQDVIMVKL",
"MAAKFNKMTL", "RRQKDPSSGT", "QQQWTELFSV")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

Classification by minimum distance

I want to classify my data by minimum distance between known centers.
How to implement using R?
the centers data
> centers
X
1 -0.78998176
2 2.40331380
3 0.77320007
4 -1.64054294
5 -0.05343331
6 -1.14982180
7 1.67658736
8 -0.44575567
9 0.36314671
10 1.18697840
the data wanted to be classified
> Y
[1] -0.7071068 0.7071068 -0.3011463 -0.9128686 -0.5713978 NA
the result I expected:
1. find the closest distance (minimum absolute difference value) between each
items in Y and centers.
2. Assigns sequence number of classes to each items in Y
expected result:
> Y
[1] 1 3 8 1 8 NA
Y <- c(-0.707106781186548, 0.707106781186548, -0.301146296962689,
-0.912868615826101, -0.571397763410073, NA)
centers <- structure(c(-0.789981758587318, 2.40331380121291, 0.773200070034431,
-1.64054294268215, -0.0534333085941505, -1.14982180092619, 1.67658736336158,
-0.445755672120908, 0.363146708827924, 1.18697840480949), .Dim = c(10L,
1L), .Dimnames = list(c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10"), "X"))
sapply(Y, function(y) {r=which.min(abs(y-centers)); ifelse(is.na(y), NA, r)})
Essentially, you are applying which.min to each element of Y, and determining which center has the smallest absolute distance. Ties go to the earlier element on the list. NA values need to be handled separately, which is why I have a second statement with ifelse there.
This is not clustering.
But nearest neighbor classification.
See the knn function.

Converting csv values to table in R

I have some data from a poll which looks like this:
Freetime_activities
1 Travelling, On the PC, Clubbing
2 Sports, On the PC, Clubbing
3 Clubbing
4 On the PC
5 Travelling, On the PC, Clubbing
6 On the PC
7 Watching TV, Travelling
I want to get the count of each value (how many times Travelling/On the PC/etc.), but I'm having trouble splitting the values. Is there a function in R that can do for example:
split("A,B,C") ->
1 A
2 B
3 C
Or is there a straight forward solution to counting the values directly from the column?
We can use strsplit to split the column by the delimiter ", "), unlist the list output and then use table to get the frequency
tbl <- table(unlist(strsplit(as.character(df1$Freetime_activities),
", ")))
as.data.frame(tbl)
# Var1 Freq
#1 Clubbing 4
#2 On the PC 5
#3 Sports 1
#4 Travelling 3
#5 Watching TV 1
NOTE: Here is used as.character in case the column is a factor as strsplit can take only character vectors.
Or another option would be to use scan to extract the elements, and then with table get the frequency.
table(trimws(scan(text = as.character(df1$Freetime_activities),
what = "", sep = ",")))
Or using read.table with unlist and table
table(unlist(read.table(text = as.character(df1$Freetime_activities),
sep = ",", fill = TRUE, strip.white = TRUE)))
EDIT: Based on #David Arenburg's comments.
data
df1 <- structure(list(Freetime_activities = c("Travelling, On the PC,
Clubbing",
"Sports, On the PC, Clubbing", "Clubbing", "On the PC", "Travelling,
On the PC, Clubbing",
"On the PC", "Watching TV, Travelling")),
.Names = "Freetime_activities",
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))

using sweep (or something similar) in R for listed data frames

I'm doing a handful of transformation steps for several dfs, so I have ventured into the beautiful world of apply, lapply, sweep, etc. Unfortunately I got stuck trying to use sweep for listed dfs.
What I would like to do, is calculate the percentage of each value, based on the mean of each data frame's first row.
So I put my dfs into a list which ends up looking something like this;
df1 <- read.table(header = TRUE, text = "a b
1 16.26418 19.60232
2 16.09745 18.44320
3 17.25242 18.21141
4 17.61503 17.64766
5 18.35453 19.52620")
df2 <- read.table(header = TRUE, text = "a b
1 4.518654 4.346056
2 4.231176 4.175854
3 2.658694 4.999478
4 3.348019 2.345594
5 3.103378 2.556690")
list.one <- list(df1,df2)
> list.one
[[1]]
a b
1 16.26418 19.60232
2 16.09745 18.44320
3 17.25242 18.21141
4 17.61503 17.64766
5 18.35453 19.52620
[[2]]
a b
1 4.518654 4.346056
2 4.231176 4.175854
3 2.658694 4.999478
4 3.348019 2.345594
5 3.103378 2.556690
Now I calculate the mean of each first row and store it
one.hundred <- lapply(list.one, function(i)
{rowMeans(i[1,], na.rm=T)})
> one.hundred
[[1]]
1
17.93325
[[2]]
1
4.432355
Now I calculate their percentage (as compared to the values stored in the second list) and the best I came up with is this rather tedious workaround:
df1.per<-sweep(list.one[[1]], 1, one.hundred[[1]],
function(x,y){100/y*x})
df2.per<-sweep(list.one[[2]], 1, one.hundred[[2]],
function(x,y){100/y*x})
list.new(df1.per,df2.per)
If somebody could suggest me simpler, preferably list based solution that would be great help.
Thanks a lot.
Here's another approach with sapply and Map that will also return a list of data.frames:
means <- sapply(list.one, function(df) rowMeans(df[1, ], na.rm = TRUE))
Map(function(vec, df) df/vec*100, means, list.one)
#$`1`
# a b
#1 90.69287 109.30713
#2 89.76315 102.84360
#3 96.20353 101.55109
#4 98.22553 98.40748
#5 102.34916 108.88266
#
#$`1`
# a b
#1 101.94702 98.05298
#2 95.46113 94.21299
#3 59.98378 112.79507
#4 75.53589 52.91981
#5 70.01646 57.68243
data:
> dput(list.one)
list(structure(list(a = c(16.26418, 16.09745, 17.25242, 17.61503,
18.35453), b = c(19.60232, 18.4432, 18.21141, 17.64766, 19.5262
)), .Names = c("a", "b"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")), structure(list(a = c(4.518654, 4.231176,
2.658694, 3.348019, 3.103378), b = c(4.346056, 4.175854, 4.999478,
2.345594, 2.55669)), .Names = c("a", "b"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))

R - Extremely slow code

Im new to R and im stuck with a problem i can't solve by myself.
A friend recommended me to use one of the apply functions, i just dont get how to use it in this case. Anyway, on to the problem! =)
Inside the inner while loop, I have an ifelse. That is the bottleneck. It takes on average 1 second to run each iteration. The slow part is marked with #slow part start/end in the code.
Given that, we will run it 2000*100 = 200000 times it will take aproximately 55.5 hours to finish each time we run this code. And the bigger problem is that this will be reused a lot. So x*55.5 hours is just not doable.
Below is a fraction of the code relevant to the question
#dt is data.table with close to 1.5million observations of 11 variables
#rand.mat is a 110*100 integer matrix
j <- 1
while(j <= 2000)
{
#other code is executed here, not relevant to the question
i <- 1
while(i <= 100)
{
#slow part start
t$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
#slow part end
i <- i + 1
}
#other code is executed here, not relevant to the question
j <- j + 1
}
Please, any advice would be greatly appreciated.
EDIT - Run below code to reproduce problem
library(data.table)
dt = data.table(datecolumn=c("20121101", "20121101", "20121104", "20121104", "20121130",
"20121130", "20121101", "20121101", "20121104", "20121104", "20121130", "20121130"), column2=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column3=c("5",
"3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column4=c
("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))
unq_date <- c(20121101L,
20121102L, 20121103L, 20121104L, 20121105L, 20121106L, 20121107L,
20121108L, 20121109L, 20121110L, 20121111L, 20121112L, 20121113L,
20121114L, 20121115L, 20121116L, 20121117L, 20121118L, 20121119L,
20121120L, 20121121L, 20121122L, 20121123L, 20121124L, 20121125L,
20121126L, 20121127L, 20121128L, 20121129L, 20121130L
)
index <- as.numeric(dt$column4)
numberOfRepititions <- 2
set.seed(131107)
rand.mat <- replicate(numberOfRepititions, sample(unq_date, numberOfRepititions))
i <- 1
while(i <= numberOfRepititions)
{
dt$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
i <- i + 1
}
Notice that we wont be able to run the loop more than 2 times now unless dt grows in rows so that we have the initial 100 types of column4 (which is just an integer value 1-100)
Here is one proposal which is based on your small example dataset. I tried to vectorize the operations. Like in your example, numberOfRepititions represents the number of loop runs.
First, create matrices for all necessary evaluations. dt$datecolum is compared with all columns of rand.mat:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
Here, dt$column4 is compared with all values of the vector index:
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
Both matrices are combined with logical and. Afterwards, we calculate whether there is at least one TRUE:
replace_idx <- rowSums(rmat & imat) != 0
Use the created index to replace corresponding values with NA:
is.na(dt$column2) <- replace_idx
Done.
The code in one chunk:
rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
replace_idx <- rowSums(rmat & imat) != 0
is.na(dt$column2) <- replace_idx
I think you can do it in 1 line like this:
dt[which(apply(dt, 1, function(x) x[1] %in% rand.mat[,as.numeric(x[4])])),]$column3<-NA
basically the apply function works as follows by argument:
1) uses the data from "dt"
2) "1" means apply by row
3) the function passes the row as 'x', returns TRUE if your criteria are met

Resources