R ggplot2 number of rows of the same values in a column - r

I'm new to R and plotting in R. This might be a very simple question but here it is,
Suppose I have a data frame like this:
a b c d
1 5 6 7
2 3 5 7
1 4 6 2
2 3 5 NA
1 4 4 2
2 2 4 2
1 2 5 1
2 3 4 NA
Here a, b, c, d are column names. I want to plot a bar chart that has values in column d on the x axis, and the number of rows with that value on y axis. So 7 has 2 rows, 1 has 1 and 2 has 3. It's not important to include missing values in between(3, 4, 5, 6).
So the result would be something like a histogram. I know I can do counting on column d and then do the plotting but I feel there must be a better way to do this.

Here's an approach--if I understand your question, columns A, B, and C are immaterial to what you are doing, which is plotting frequencies of column D.
library(ggplot2)
library(reshape)
##get frequencies of col d
test.summary<-table(test$d)
## re-shape the data
test.summary.m<-melt(test.summary)
ggplot(test.summary.m,aes(x=as.factor(Var.1),y=value))+
geom_bar(stat='identity')

Related

Removing rows from a dataset based on conditional statement across factors

I am struggling to figure out how to remove rows from a dataset based on conditions across multiple factors in a large dataset. Here is some example data to illustrate the problem I am having with a smaller data frame:
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
data$Value <- (as.numeric(data$Value))
data
Code Value
1 A 1
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I want to remove values where the Code is A and the Value is < 2 from the dataset. I understand the logic of how to select for values where Code is A and Values <2, but I can't figure out how to remove these values from the dataset without also removing all values of A that are > 2, while maintaining values of the other codes that are less than 2.
#Easy to select for values of A less than 2
data2<- subset(data, (Code == "A" & Value < 2))
data2
Code Value
1 A 1
#But I want to remove values of A less than 2 without also removing values of A that are greater than 2:
data1<- subset(data, (Code != "A" & Value > 2))
data1
Code Value
3 C 3
4 D 4
### just using Value > 2 does not allow me to include values that are less than 2 for the other Codes (B,C,D):
data2<- subset(data, Value > 2)
data2
3 C 3
4 D 4
7 A 3
8 A 4
My ideal dataset would look like this:
data
Code Value
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I have tried different iterations of filter(), subset(), and select() but I can't figure out the correct conditional statement that allows me to remove the desired combination of levels of multiple factors. Any suggestions would be greatly appreciated.

R plotly: Customize x-axis values in box plot

I have a data frame with 3 variables and 260 rows. (Sample below)
HouseID<-c(1:10)
Town<-c("D","A","B","C","A","B","C","C","C","A")
Occupants<-c(5,3,2,4,5,2,3,8,1,3)
df<-data.frame(HouseID,Town,Occupants)
HouseID Town Occupants
1 D 5
2 A 3
3 B 2
4 C 4
5 A 5
6 B 2
7 C 3
8 C 8
9 C 1
10 A 3
I want to create a box plot for the distribution of Occupants with the order of x-axis based on the descending order of frequencies of Towns
Town Freq
A 3
B 2
C 4
D 1
(Shown a sample image)
I tried sorting the data frame, but still, the box plot x-axis is displayed based on alphabetical order by default. Is there a way I could do this?
You simply have to use factor to reorder levels of df$Town according to their count summary(df$Town):
df$Town <- factor(df$Town, levels(df$Town)[order(summary(df$Town),decreasing = TRUE)])
plot_ly(df, x=~Town, y=~Occupants, type="box")

Extract data from data.frame based on coordinates in another data.frame

So here is what my problem is. I have a really big data.frame woth two columns, first one represents x coordinates (rows) and another one y coordinates (columns), for example:
x y
1 1
2 3
3 1
4 2
3 4
In another frame I have some data (numbers actually):
a b c d
8 7 8 1
1 2 3 4
5 4 7 8
7 8 9 7
1 5 2 3
I would like to add a third column in first data.frame with data from second data.frame based on coordinates from first data.frame. So the result should look like this:
x y z
1 1 8
2 3 3
3 1 5
4 2 8
3 4 8
Since my data.frames are really big the for loops are too slow. I think there is a way to do this with apply loop family, but I can't find how. Thanks in advance (and sorry for ugly message layout, this is my first post here and I don't know how to produce this nice layout with code and proper data.frames like in another questions).
This is a simple indexing question. No need in external packages or *apply loops, just do
df1$z <- df2[as.matrix(df1)]
df1
# x y z
# 1 1 1 8
# 2 2 3 3
# 3 3 1 5
# 4 4 2 8
# 5 3 4 8
A base R solution: (df1 and df2 are coordinates and numbers as data frames):
df1$z <- mapply(function(x,y) df2[x,y], df1$x, df1$y )
It works if the last y in the first data frame is corrected from 5 to 4.
I guess it was a typo since you don't have 5 columns in the second data drame.
Here's how I would do this.
First, use data.table for fast merging; then convert your data frames (I'll call them dt1 with coordinates and vals with values) to data.tables.
dt1<-data.table(dt)
vals<-data.table(vals)
Second, put vals into a new data.table with coordinates:
vals_dt<-data.table(x=rep(1:dim(vals)[1],dim(vals)[2]),
y=rep(1:dim(vals)[2],each=dim(vals)[1]),
z=matrix(vals,ncol=1)[,1],key=c("x","y"))
Now merge:
setkey(dt1,x,y)[vals_dt,z:=z]
You can also try the data.table package and update df1 by reference
library(data.table)
setDT(df1)[, z := df2[cbind(x, y)]][]
# x y z
# 1: 1 1 8
# 2: 2 3 3
# 3: 3 1 5
# 4: 4 2 8
# 5: 3 4 8

R combine nx4 into nx2

I have a dataset that has 1 factors (4 levels). However each factor level and data is currently in its own column, with a factor level label at the top (Matrix of n by 4).
To do an anova I want to change this to a n by 2 with all the factor labels in column A and all the data in column B.
I could easily cut and paste this in Excel, then back into a csv- but assume there is a way to do this with cbind.
Sample data:
A B C D
2 4 6 8
3 5 7 9
What I require:
A 2
A 3
B 4
B 5
C 6
C 7
D 8
D 9
You should use stack:
stack(df) # where `df` is your data.frame
stack is better here but also:
library(reshape2)
melt(df)

how to plot overlay multiple time series given condition(s) in lattice?

Suppose I have a data frame, df, that looks like:
f t1 t2 t3
h 1 3 4
h 2 4 3
t 3 4 5
t 5 6 8
with f being a factor and $t attributes being numerical values related to time ordered events.
I could overlay time series t1 to t3 using par(new=T) and isolate by factor manually.
But I wonder if there is some way to do this with lattice, where the overlaid time series
are conditioned by the factor. So we would have two panels, with overlaid time series corresponding to conditional factors, f. Most examples I've seen only use one time series (vector) per factor. I also thought about using a parallel plot, but time information is lost.
I've also tried something like
xyplot(df$t1+df$t2+df$t3 ~seq(3) | factor(df$f))
, but it loses row sequence connections. Anyone know if this is possible?
Here's a very crude illustration using non lattice approach.
x<-matrix(seq(12),4,3)
f<-c('a','a','b','b')
df<-data.frame(f,x)
layout(1:2); yr<-c(0,12); xr<-c(1,3);
plot(as.numeric(df[1,2:4])~seq(3),type='o',ylim=yr,xlim=xr,ylab='A')
par(new=T)
plot(as.numeric(df[2,2:4])~seq(3),type='o',ylim=yr,xlim=xr,ylab='A')
plot(as.numeric(df[3,2:4])~seq(3), type='o',ylim=yr,xlim=xr,ylab='B')
par(new=T)
plot(as.numeric(df[4,2:4])~seq(3),type='o',ylim=yr,xlim=xr,ylab='B')
I added an ID variable and melted with package:reshape2
dat
f t1 t2 t3 ID
1 h 1 3 4 1
2 h 2 4 3 2
3 t 3 4 5 3
4 t 5 6 8 4
datm <- melt(dat, id.vars=c("ID","f"), measure.vars=c("t1", "t2", "t3"))
> datm
ID f variable value
1 1 h t1 1
2 2 h t1 2
3 3 t t1 3
4 4 t t1 5
5 1 h t2 3
6 2 h t2 4
7 3 t t2 4
8 4 t t2 6
9 1 h t3 4
10 2 h t3 3
11 3 t t3 5
12 4 t t3 8
Since you asked to have it "overlayed" I used the group parameter to keep the ID's separate and the "|" operator to give you the two panels for "h" and "t":
xyplot(value~variable|f, group=ID, data=datm, type="b")
(1) This can be done compactly using xyplot.zoo . The first statement converts the data frame to a zoo series (series are stored in columns in zoo objects) and the second statement plots it such that the screen argument defines which panel each series is shown in:
library(zoo)
library(lattice)
z <- zoo(t(df[-1]))
xyplot(z, screen = df$f, type = "o")
(2) or if it were desired to show df's column names on the X axis instead then define z as the following (and then issue the xyplot command above):
z <- zoo(t(df[-1])), factor(names(df[-1])))
xyplot using the z in the first point looks like this (and the second is the same except for the X axis labels):
EDIT: simplified (2)

Resources