Visualization of data through stripchart with different pch values (in columns) - r

May be it is simple. But I am not getting how to plot the following data using R.
Basically the x-axis has type (i.e., A, B, C, D four labels). On the plot I want to represent the numbers (v1, v2, v3, v4) as points. For example, for A, on the y-axis I want to point 99, 110, 150, and 170. Moreover, I need to use different pch value for point. I tried to use stripchart but I am not sure how to assign different pch values under each of the types A, B, C, D. Please see for points (99, 110, 150, 170) of A in the below figure.
Data:
type A B C D
v1 99 227 295 503
v2 110 440 620 970
v3 150 600 934 1330
v4 170 650 1012 1390

Are you looking for something like this?
df = read.table(header=TRUE, text="type A B C D
v1 99 227 295 503
v2 110 440 620 970
v3 150 600 934 1330
v4 170 650 1012 1390")
stripchart(df[-1], pch=c(1, 2, 3, 4), vertical=TRUE)
Resulting in:
Update
Sorry, I misread the first question. Try this:
df2 = data.frame(t(df[-1]))
names(df2) = df$type
df2$group = rownames(df2)
library(lattice)
stripplot(v1 + v2 + v3 + v4 ~ group, data=df2, pch=c(1, 2, 3, 4))

Related

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

2D irregular aggregation of a matrix

I'm trying to bin a symmetric matrix with irregular intervals in R but am not sure how to proceed. My ideas are:
Reshape the matrix to long format, aggregate and cast it back?
Bin as-is in both dimensions (somehow... tapply, aggregate?)
Keep the regular binning but for each of my (larger) irregular bins, replace all inner values with their sum?
Here's an example of what I'm trying to do:
set.seed(42)
# symmetric matrix
a <- matrix(rpois(1e4, 2), 100)
a[upper.tri(a)] <- t(a)[upper.tri(a)]
image(x=1:100, y=1:100, a, asp=1, frame=F, axes=F)
# vector of irregular breaks for binning
breaks <- c(12, 14, 25, 60, 71, 89)
# white line show the desired bins
abline(h=breaks-.5, lwd=2, col="white")
abline(v=breaks-.5, lwd=2, col="white")
(The aim being that each rectangle drawn above be filled according to the sum of values within it.) I'd appreciate any pointers of how best to approach this.
This answer provides a great starting point using tapply:
b <- melt(a)
bb <- with(b, tapply(value,
list(
y=cut(Var1, breaks=c(0, breaks, Inf), include.lowest=T),
x=cut(Var2, breaks=c(0, breaks, Inf), include.lowest=T)
),
sum)
)
bb
# x
# y [0,12] (12,14] (14,25] (25,60] (60,71] (71,89] (89,Inf]
# [0,12] 297 48 260 825 242 416 246
# (12,14] 48 3 43 141 46 59 33
# (14,25] 260 43 261 794 250 369 240
# (25,60] 825 141 794 2545 730 1303 778
# (60,71] 242 46 250 730 193 394 225
# (71,89] 416 59 369 1303 394 597 369
# (89,Inf] 246 33 240 778 225 369 230
These can then be plotted as rectangular bins using a base plot and rect — i.e.:
library("reshape2")
library("magrittr")
bsq <- melt(bb)
# convert range notation to numerics
getNum <- . %>%
# rm brackets
gsub("\\[|\\(|\\]|\\)", "", .) %>%
# split digits and convert
strsplit(",") %>%
unlist %>% as.numeric
y <- t(sapply(bsq[,1], getNum))
x <- t(sapply(bsq[,2], getNum))
# normalise bin intensity by area
bsq$size <- (y[,2] - y[,1]) * (x[,2] - x[,1])
bsq$norm <- bsq$value / bsq$size
# draw rectangles on top of empty plot
plot(1:100, 1:100, type="n", frame=F, axes=F)
rect(ybottom=y[,1], ytop=y[,2],
xleft=x[,1], xright=x[,2],
col=rgb(colorRamp(c("white", "steelblue4"))(bsq$norm / max(bsq$norm)),
alpha=255*(bsq$norm / max(bsq$norm)), max=255),
border="white")

barplot in R 3.1.1

even if I am getting used to R I am still new with it and I hope that someone can help me deal with this task ...I have tried to look for some previous topics but I couldn't find what I was looking for, so here I am hoping for some help.
I am trying to draw my bar plot but I am not having much luck on some of the settings so I hope someone could help. I am using R 3.1.1 on my mac OS 10.9.4.
my table look like this:
family area1 area2 area3 area4 area5 area6
A 15 20 500 200 17 26
B 170 520 26 13 100 70
C 35 250 358 128 88 26
D 95 375 289 156 169 356
E 425 177 136 144 285 70
since I have the file save it as a csv I am doing this steps:
fam <- read.csv ("family_per_area_count.csv", sep =";", header = T)
I am converting the file as a matrix
fam.mat <- as.matrix(fam_1, ncol = 6, byrow = T)
then I assign row names and col names
rownames(fam.mat) <- c("A", "B", "C", "D", "E")
colnames(fam.mat) <- c("area1", "area2", "area3", "area4", "area5", "area6")
then I am simply running the bar plot command as
barplot(fam.mat, beside = T, col = rainbow(ncol(fam.mat)))
but I am missing most of the labels for the x axis and the plot seems to be pressed together.
I also tried to run the cumulative bar plot using this command
par(mar = c(5.1, 4.1, 4.1, 7.1), xpd = TRUE)
prop <- prop.table(data_mat, margin = 2)
barplot(data_mat, col = rainbow(length(rownames(data_mat))), width = 3)
legend("topright", inset = c(-0.25, 0), fill = rainbow(length(rownames(data_mat))),
legend = rownames(data_mat))
but the legend colours don't match the data and again my x-axis seems out of center. I have tried to transpose the matrix but still no luck.
Can anyone make any suggestion?
Thank you so much in advance
F.
Here is a start:
DF <- read.table(text="family area1 area2 area3 area4 area5 area6
A 15 20 500 200 17 26
B 170 520 26 13 100 70
C 35 250 358 128 88 26
D 95 375 289 156 169 356
E 425 177 136 144 285 70", header=TRUE)
library(reshape2)
DF <- melt(DF, id.var="family")
library(ggplot2)
ggplot(DF, aes(x=family, y=value, fill=variable)) +
geom_bar(stat="identity", position="dodge")
Study ggplot2 documentation and tutorials to learn how to customise the plot.

Number of unique values within a range in data-frame

From a data-frame, I want to extract the number of unique values (of X) within a certain range of Y (e.g. for every 0-100, 101-200, 201-300, etc. up to 3000).
Example df
X Y
169 183
546 64
154 148
593 203
60 243
568 370
85 894
168 169
154 148
83 897
…
A time consuming way would be to run the following code for each range:
junk<-subset(df, Y > 0 & Y < 100)
length(unique(junk$record.no))
But I have to ask the experts - there must be a better way?
You can do it with by() and cut():
data <- data.frame(X=ceiling(rnorm(10000, 500, 10)), Y=runif(10000, 0, 3000))
data$Groups <- cut(data$Y, seq(0, 3000, 100)) # Create a categorical variable for each range
by(data$X, data$Group, function(x) length(unique(x)))
This seems valid:
aggregate(DF$X, list(cut(DF$Y, seq(0, 1000, 100))), function(x) unique(x))
# Group.1 x #or length(unique(x))
#1 (0,100] 546
#2 (100,200] 169, 154, 168
#3 (200,300] 593, 60
#4 (300,400] 568
#5 (800,900] 85, 83
You can run a for loop based on the range you want and the size of the dataframe and then count the number of levels by converting to factor:
range <- 100 #based on example
loops <- nrow(df)/range
lvlMatrix <- matrix(nrow=0,ncol=2,dimnames=list(NULL,c("range","unique values")))
for(a in 1:loops){
sub <- df[((a-1)*range):(range*a),]
lvls<-nlevels(factor(sub$X))
lvlMatrix <- rbind(lvlMatrix,cbind(paste(as.character((a-1)*range),"-",as.character(range*a),sep=""),lvls))
}

Resources