How to make a spaghetti plot in R? - r

I have the following:
heads(dataframe):
ID Result Days
1 70 0
1 80 23
2 90 15
2 89 30
2 99 40
3 23 24
ect...
what I am trying to do is: Create a spaghetti plot with the above datast. What I use is this:
interaction.plot(dataframe$Days,dataframe$ID,dataframe$Result,xlab="Time",ylab="Results",legend=F) but none of the patient lines are continuous even when they were supposed to be a long line.
Also I want to convert the above dataframe to something like this:
ID Result Days
1 70 0
1 80 23
2 90 0
2 89 15
2 99 25
3 23 0
ect... ( I am trying to take the first (or minimum) of each id and have their dating starting from zero and up). Also in the spaghetti plot i want all patients to have the same color IF a condition in met, and another color if the condition is not met.
Thank you for your time and patience.

How about this, using ggplot2 and data.table
# libs
library(ggplot2)
library(data.table)
# your data
df <- data.table(ID=c(1,1,2,2,2,3),
Result=c(70,80,90,89,99,23),
Days=c(0,23,15,30,40,24))
# adjust each ID to start at day 0, sort
df <- merge(df, df[, list(min_day=min(Days)), by=ID], by='ID')
df[, adj_day:=Days-min_day]
df <- df[order(ID, Days)]
# plot
ggplot(df, aes(x=adj_day, y=Result, color=factor(ID))) +
geom_line() + geom_point() +
theme_bw()
Contents of updated data.frame (actually a data.table):
ID Result Days min_day adj_day
1 70 0 0 0
1 80 23 0 23
2 90 15 15 0
2 89 30 15 15
2 99 40 15 25
3 23 24 24 0
You can handle the color coding easily using scale_color_manual()

Related

Create heatmap with range of colors in a single cell in R

# dataframe
df1 <- df %>%
mutate(valuesrange=cut(values, breaks=c(0,0.05,10,100,1000,2000,3000, max(values, na.rm=T)),
labels=c("0-0.05", "0.05-10", "10-100", "100-1000", "1000-2000", "2000-3000", ">3000"))) %>%
mutate(valuesrange=factor(as.character(valuesrange), levels=rev(levels(valuesrange))))
#Order for X and Y axis labels
df1$objx <- factor(df1$objx, levels=unique(df1$objx))
df1$objy <- factor(df1$objy, levels=unique(df1$objy))
ggplot(data = df1, aes(x=objx, y=objy, fill = valuesrange)) +
geom_tile()+
scale_fill_manual(values=rev(brewer.pal(7, "YlGnBu")), na.value="grey90")
The df1 data looks like this
objy objx values valuesrange
1 1 15 1219 1000-2000
2 1 15 3911 >3000
3 1 15 3224 >3000
4 1 15 14708 >3000
5 1 15 5054 >3000
6 1 15 31499 >3000
7 1 15 1131 1000-2000
8 1 15 4368 >3000
9 1 15 2749 2000-3000
10 1 15 666. 100-1000
11 1 15 1982 1000-2000
I would like to create a heatmap of df1 data with single tick values of x axis and y axis , and the range values as mentioned in above, i need color for every rangevalues , however if use mentioned code i am able to see only one single color as in the image.
Could you please help how to generate multiple color with in signal cell:

Select the same name with different [number]

I have column names like the following plot
Can I select all alpha one time instead of typing alpha[1], alpha[2]...alpha[9]?
How can I put in the following codes to let R know I need results of all alpha?
t_alpha <- mcmc_trace(mcmc,pars="alpha")
Something like this perhaps?
library(dplyr)
library(magrittr)
df %>% select(matches("^alpha"))`
# alpha.1. alpha.10.
# 1 55 43
# 2 97 20
# 3 80 84
# 4 24 60
# 5 27 21
# 6 98 70

Mapping dataframe column values to a n by n matrix

I'm trying to map column values of a data.frame object (consisting of large number of bilateral trade data among 161 countries) to a 161 x 161 adjacency matrix (also of data.frame class) such that each cell represents the dyadic trade flows between any two countries.
The data looks like this
# load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
length(unique(example_data$rid))
[1] 139
length(unique(example_data$pid))
[1] 161
where rid is reporter id, pid is (trade) partner id, a country's rid and pid are the same. The same id(s) in the rid column are matched with multiple rows in the pid column in terms of TradeValue.
However, there are some problems with this data. First, because countries (usually developing countries) that did not report trade statistics have no data to be extracted, their id(s) are absent in the rid column (such as country 1). On the other hand, those country id(s) may enter into pid column through other countries' reporting (in which case, the reporters tend to be developed countries). Hence, the rid column only contains some of the country id (only 139 out of 161), while the pid column has all 161 country id.
What I'm attempting to do is to map this example_data dataframe to a 161 x 161 adjacency matrix using rid for row and pid for column where each cell represent the TradeValue between any two country id. To this end, there are a couple things I need to tackle with:
Fill in those country id(s) that are missing in the rid column of example_data and, temporarily, set all cell values in their respective rows to 0.
By previous step, impute those "0" cells using bilateral trade statistics reported by other countries; if the corresponding statistics are still unavailable, leave those "0" cells as they are.
For example, for a 5-country dataframe of the following form
rid pid TradeValue
2 1 50
2 3 45
2 4 7
2 5 18
3 1 24
3 2 45
3 4 88
3 5 12
5 1 27
5 2 18
5 3 12
5 4 92
The desired output should look like this
pid_1 pid_2 pid_3 pid_4 pid_5
rid_1 0 50 24 0 27
rid_2 50 0 45 7 18
rid_3 24 45 0 88 12
rid_4 0 7 88 0 92
rid_5 27 18 12 92 0
but on top of my mind, I could not figure out how to. It will be really appreciated if someone can help me on this.
df1$rid = factor(df1$rid, levels = 1:5, labels = paste("rid",1:5,sep ="_"))
df1$pid = factor(df1$pid, levels = 1:5, labels = paste("pid",1:5,sep ="_"))
data.table::dcast(df1, rid ~ pid, fill = 0, drop = FALSE, value.var = "TradeValue")
# rid pid_1 pid_2 pid_3 pid_4 pid_5
#1 rid_1 0 0 0 0 0
#2 rid_2 50 0 45 7 18
#3 rid_3 24 45 0 88 12
#4 rid_4 0 0 0 0 0
#5 rid_5 27 18 12 92 0
The secrets/ tricks:
use factor variables to tell R what values are all possible as well as the order.
in data.tables dcast use fill = 0 (fill zero where you have nothing), drop = FALSE (make entries for factor levels that aren't observed)

Tidying Time Intervals for Plotting Histogram in R

I'm doing some cluster analysis on the MLTobs from the LifeTables package and have come across a tricky problem with the Year variable in the mlt.mx.info dataframe. Year contains the period that the life table was taken, in intervals. Here's a table of the data:
1751-1754 1755-1759 1760-1764 1765-1769 1770-1774 1775-1779 1780-1784 1785-1789 1790-1794
1 1 1 1 1 1 1 1 1
1795-1799 1800-1804 1805-1809 1810-1814 1815-1819 1816-1819 1820-1824 1825-1829 1830-1834
1 1 1 1 1 2 3 3 3
1835-1839 1838-1839 1840-1844 1841-1844 1845-1849 1846-1849 1850-1854 1855-1859 1860-1864
4 1 5 3 8 1 10 11 11
1865-1869 1870-1874 1872-1874 1875-1879 1876-1879 1878-1879 1880-1884 1885-1889 1890-1894
11 11 1 12 2 1 15 15 15
1895-1899 1900-1904 1905-1909 1908-1909 1910-1914 1915-1919 1920-1924 1921-1924 1922-1924
15 15 15 1 16 16 16 2 1
1925-1929 1930-1934 1933-1934 1935-1939 1937-1939 1940-1944 1945-1949 1947-1949 1948-1949
19 19 1 20 1 22 22 3 1
1950-1954 1955-1959 1956-1959 1958-1959 1960-1964 1965-1969 1970-1974 1975-1979 1980-1984
30 30 2 1 40 40 41 41 41
1983-1984 1985-1989 1990-1994 1991-1994 1992-1994 1995-1999 2000-2003 2000-2004 2005-2006
1 42 42 1 1 44 3 41 22
2005-2007
14
As you can see, some of the intervals sit within other intervals. Thankfully none of them overlap. I want to simplify the intervals so intervals such as 1992-1994 and 1991-1994 all go into 1990-1994.
An idea might be to get the modulo of each interval and sort them into their new intervals that way but I'm unsure how to do this with the interval data type. If anyone has any ideas I'd really appreciate the help. Ultimately I want to create a histogram or barplot to illustrate the nicely.
If I understand your problem, you'll want something like this:
bottom <- seq(1750, 2010, 5)
library(dplyr)
new_df <- mlt.mx.info %>%
arrange(Year) %>%
mutate(year2 = as.numeric(substr(Year, 6, 9))) %>%
mutate(new_year = paste0(bottom[findInterval(year2, bottom)], "-",(bottom[findInterval(year2, bottom) + 1] - 1)))
View(new_df)
So what this does, it creates bins, and outputs a new column (new_year) that is the bottom of the bin. So everything from 1750-1754 will correspond to a new value of 1750-1754 (in string form; the original is an integer type, not sure how to fix that). Does this do what you want? Double check the results, but it looks right to me.

ggplot2 is plotting a line strangely

i am trying to plot the time series x_t = A + (-1)^t B
To do this i am using the following code. The problem is, that the ggplot is wrong.
require (ggplot2)
set.seed(42)
N<-2
A<-sample(1:20,N)
B<-rnorm(N)
X<-c(A+B,A-B)
dat<-sapply(1:N,function(n) X[rep(c(n,N+n),20)],simplify=FALSE)
dat<-data.frame(t=rep(1:20,N),w=rep(A,each=20),val=do.call(c,dat))
ggplot(data=dat,aes(x=t, y=val, color=factor(w)))+
geom_line()+facet_grid(w~.,scale = "free")
looking at the head of dat everything looks right:
> head(dat)
t w val
1 1 12 10.5533
2 2 12 13.4467
3 3 12 10.5533
4 4 12 13.4467
5 5 12 10.5533
6 6 12 13.4467
So the lower (blue) line should only have values 10.5533 and 13.4467. But it also takes different values. What is wrong in my code?
Thanks in advance for any help
You really should be more careful before asserting that something is "wrong". The way you are creating dat the rows are not ordered by dat$t, so head(...) is not displaying the extra values:
head(dat[order(dat$w,dat$t),],10)
# t w val
# 21 1 18 18.43530
# 61 1 18 18.36313
# 22 2 18 19.56470
# 62 2 18 17.63687
# 23 3 18 18.43530
# 63 3 18 18.36313
# 24 4 18 19.56470
# 64 4 18 17.63687
# 25 5 18 18.43530
# 65 5 18 18.36313
Note the row numbers.

Resources