R - Creating Scatter Plot from Data Frame - r

i've got a data frame all that look like this:
http://pastebin.com/Xc1HEYyH
Now I want to create a scatter plot with the column headings in the x-axis and the respective values as the data points. For example:
7| x
6| x x
5| x x x x
4| x x x
3| x x
2| x x
1|
---------------------------------------
STM STM STM PIC PIC PIC
cold normal hot cold normal hot
This should be easy, but I can not figure out how.
Regards

The basic idea, if you want to plot using Hadley's ggplot2 is to get your data of the form:
x y
col_names values
And this can be done by using melt function from Hadley's reshape2. Do ?melt to see the possible arguments. However, here since we want to melt the whole data.frame, we just need,
melt(all)
# this gives the data in format:
# variable value
# 1 STM_cold 6.0
# 2 STM_cold 6.0
# 3 STM_cold 5.9
# 4 STM_cold 6.1
# 5 STM_cold 5.5
# 6 STM_cold 5.6
Here, x will be then column variable and y will be corresponding value column.
require(ggplot2)
require(reshape2)
ggplot(data = melt(all), aes(x=variable, y=value)) +
geom_point(aes(colour=variable))
If you don't want the colours, then just remove aes(colour=variable) inside geom_point so that it becomes geom_point().
Edit: I should probably mention here, that you could also replace geom_point with geom_jitter that'll give you, well, jittered points:

Here are two options to consider. The first uses dotplot from the "lattice" package:
library(lattice)
dotplot(values ~ ind, data = stack(all))
The second uses dotchart from base R's "graphics" options. To use the dotchart function, you need to wrap your data.frame in as.matrix:
dotchart(as.matrix(all), labels = "")
Note that the points in this graphic are not "jittered", but rather, presented in the order they were recorded. That is to say, the lowest point is the first record, and the highest point is the last record. If you zoomed into the plot for this example, you would see that you have 16 very faint horizontal lines. Each line represents one row from each column. Thus, if you look at the dots for "STM_cold" or any of the other variables that have NA values, you'll see a few blank lines at the top where there was no data available.
This has its advantages since it might show a trend over time if the values are recorded chronologically, but might also be a disadvantage if there are too many rows in your source data frame.

A bit of a manual version using base R graphics just for fun.
Get the data:
test <- read.table(text="STM_cold STM_normal STM_hot PIC_cold PIC_normal PIC_hot
6.0 6.6 6.3 0.9 1.9 3.2
6.0 6.6 6.5 1.0 2.0 3.2
5.9 6.7 6.5 0.3 1.8 3.2
6.1 6.8 6.6 0.2 1.8 3.8
5.5 6.7 6.2 0.5 1.9 3.3
5.6 6.5 6.5 0.2 1.9 3.5
5.4 6.8 6.5 0.2 1.8 3.7
5.3 6.5 6.2 0.2 2.0 3.5
5.3 6.7 6.5 0.1 1.7 3.6
5.7 6.7 6.5 0.3 1.7 3.6
NA NA NA 0.1 1.8 3.8
NA NA NA 0.2 2.1 4.1
NA NA NA 0.2 1.8 3.3
NA NA NA 0.8 1.7 3.5
NA NA NA 1.7 1.6 4.0
NA NA NA 0.1 1.7 3.7",header=TRUE)
Set up the basic plot:
plot(
NA,
ylim=c(0,max(test,na.rm=TRUE)+0.3),
xlim=c(1-0.1,ncol(test)+0.1),
xaxt="n",
ann=FALSE,
panel.first=grid()
)
axis(1,at=seq_along(test),labels=names(test),lwd=0,lwd.ticks=1)
Plot some points, with some x-axis jittering so they are not printed on top of one another.
invisible(
mapply(
points,
jitter(rep(seq_along(test),each=nrow(test))),
unlist(test),
col=rep(seq_along(test),each=nrow(test)),
pch=19
)
)
Result:
edit
Here's an example using alpha transparency on the points and getting rid of the jitter as discussed in the below comments with Ananda.
invisible(
mapply(
points,
rep(seq_along(test),each=nrow(test)),
unlist(test),
col=rgb(0,0,0,0.1),
pch=15,
cex=3
)
)

Related

Trying to use a variable as label in ggplots

I'm not sure what's going on here, but when I try to run ggplots, it tells me that u and u1 are not valid lists. Did I enter u and u1 incorrectly, that it thinks these are functions, did I forget something, or did I enter things wrong into ggplots?
u1 <- function(x,y){max(utilityf1(x))}
utilityc1 <- data.frame("utilityc1" =
u(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20),
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
utilityc1 <- data.frame("utilityc1" =
u1(c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20),
c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)))
hhcomp <- data.frame(
pqx, pqy, utility, hours, p1qx, p1qy, utilit, utilityc1,
utilityc, u,u1, o, o1, o2
)
library(ggplot2)
ggplot(hhcomp, aes(x=utility, y=consumption))+
coord_cartesian(xlim = c(0, 16) )+
ylim(0,20)+
labs(x = "leisure(hours)",y="counsumption(units)")+
geom_line(aes(x = u, y = consumption))+
geom_line(aes(x = u1, y = consumption))
I'm not sure what else to explain, so if someone could provide some help on providing code to stack overflow that would be useful. I'm also not sure how much of a description to have, I should have enough code to be reproducible, but there is a problem that Stack Overflow only allows so much code, so it would be good to know the right amount to add.
I think you may need to read the documentation for ggplot2 and maybe r in general.
data.frame
For starters, a data.frame object is a collection of vectors appended together column wise. Most of what you have defined as inputs for hhcomp are functions, which cannot be stored as a data.frame. A canonical example of a data frame in r is iris
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
#3 4.7 3.2 1.3 0.2 setosa
#4 4.6 3.1 1.5 0.2 setosa
#5 5.0 3.6 1.4 0.2 setosa
#6 5.4 3.9 1.7 0.4 setosa
str(iris) #print the structure of an r object
#'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
functions
There is a lot going on with your functions. Nested functions are fine, but it seems as though you are failing to pass all values on. This probably means you are trying to apply R's scoping rules but this makes code ambiguous of where values are found.
With the currently defined functions, calling u(1:2,3:4) passes 1:2 to utilityf but utilityf's y argument is never assigned (but with r's lazy evaluation we reach a different error before r realizes that this value is missing). The next function that gets evaluated in this nest is p1qyf which is defined as follows
p1qyf <- function(y){(w1*16)-(w1*x)}
with this definition, it does not matter what you pass to the argument y it will never be used and will always return the same thing.
#with only the function defined
p1qyf()
#Error in p1qyf() : object 'w1' not found
#defining w1
w1 <- 1.5
p1qyf()
#Error in p1qyf() : object 'x' not found
x <- 10:20
#All variables defined in the function
#can now be found in the global environment
#thus the function can be called with no errors because
#w1 and x are defined somewhere...
p1qyf() #nothing assigned to y
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
p1qyf(y = iris) #a data.frame assigned to y
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
p1qyf(y = foo_bar) #an object that hasn't even been assigned yet
[1] 9.0 7.5 6.0 4.5 3.0 1.5 0.0 -1.5 -3.0 -4.5 -6.0
I imagine you actually intend to define it this way
p1qyf <- function(y){(w1*16)-(w1*y)}
#Now what we pass to it affects the output
p1qyf(1:10)
#[1] 22.5 21.0 19.5 18.0 16.5 15.0 13.5 12.0 10.5 9.0
head(p1qyf(iris))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 16.35 18.75 21.90 23.7 NA
#2 16.65 19.50 21.90 23.7 NA
#3 16.95 19.20 22.05 23.7 NA
#4 17.10 19.35 21.75 23.7 NA
#5 16.50 18.60 21.90 23.7 NA
#6 15.90 18.15 21.45 23.4 NA
You can improve this further by defining more arguments so that R doesn't need to search for missing values with it's scoping rules
p1qyf <- function(y, w1 = 1.5){(w1*16)-(w1*y)}
#w1 is defaulted to 1.5 and doesn't need to be searched for.
I would spend some time looking into your functions because they are unclear and some, such as your p1qyf, do not fully use the arguments they are passed.
ggplot
ggplot takes some type of structured data object such as data.frame tbl_df, and allows plotting. The aes mappings can take the symbol names of the column headers you wish to map. Continuing with iris as an example.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point() +
geom_line()
I hope this helps clears up why you may be getting some errors. Honestly though, if you were actually able to declare a data.frame then the problem here is that your post is still not that reproducible. Good luck
pqxf <- function(x){(1)*(y)} # replace 1 with py and assign a value to py
pqyf <- function(y){(w * 16)-(w * x)} #
utilityf <- function(x, y) { (pqyf(x)) * ((pqxf(y)))} # the utility function C,l
hours <- c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,20)
w1 <- 1.5
p1qxf <- function(x){(1)*(y)} # replace 1 with py and assign a value to p1y
p1qyf <- function(y){(w1 * 16)-(w1 * x)} #
utilityf1 <- function(x, y) { (p1qyf(x)) * ((p1qxf(y)))} # the utility function (C,l)
utilitycf <- function(x,y){max(utilityf(x))/((pqyf(y)))}
utilityc1f <- function(x,y){max(utilityf1(x))/((pqyf(y)))}
u <- function(x,y){max(utilityf(x))}
u1 <- function(x,y){max(utilityf1(x))}```

Plotting sales over time in R

I am trying to show the top 100 sales on a scatterplot by year. I used the below code to take top 100 games according to sales and then set it as a data frame.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
as.data.frame(top100)
I then tried to plot this with the below code:
ggplot(top100)+
aes(x=Year, y = Global_Sales) +
geom_point()
I bet the below error when using the subset top100
Error: data must be a data frame, or other object coercible by fortify(), not a numeric vector
if i use the actual games dataseti get the plot attached.
Any ideas?
As pointed out in comments by #CMichael, you have several issues in your code.
In absence of reproducible example, I used iris dataset to explain you what is wrong with your code.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
By doing that you are only extracting a single column.
The same command with the iris dataset:
> head(sort(iris$Sepal.Length, decreasing = TRUE), n = 20)
[1] 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 7.2 7.1 7.0 6.9 6.9 6.9 6.9 6.8 6.8 6.8
So, first, you do not have anymore two dimensions to be plot in your ggplot2. Second, even colnames are not kept during the extraction, so you can't after ask for ggplot2 to plot Year and Global_Sales.
So, to solve your issue, you can do (here the example with the iris dataset):
top100 = as.data.frame(head(iris[order(iris$Sepal.Length, decreasing = TRUE), 1:2], n = 100))
And you get a data.frame of of this type:
> str(top100)
'data.frame': 100 obs. of 2 variables:
$ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
$ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
> head(top100)
Sepal.Length Sepal.Width
132 7.9 3.8
118 7.7 3.8
119 7.7 2.6
123 7.7 2.8
136 7.7 3.0
106 7.6 3.0
And then if you are plotting:
library(ggplot2)
ggplot(top100, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Warning Based on what you provided in your example, I will suggest you to do:
top100 <- as.data.frame(head(games[order(games$NA_Sales,decreasing=TRUE),c("Year","Global_Sales")], 100))
However, if this is not satisfying to you, you should consider to provide a reproducible example of your dataset How to make a great R reproducible example

R: Free hand selection of data points in scatter plots

I would like to know if there is any good way to allow me getting the id of the points from a scatter plot by drawing a free hand polygon in R?
I found scatterD3 and it looks nice, but I can't manage to output the lab to a variable in R.
Thank you.
Roc
Here's one way
library(iplots)
with(iris, iplot(Sepal.Width,Petal.Width))
Use SHIFT (xor) or SHIFT+ALT (and) to select points (red):
Then:
iris[iset.selected(), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 119 7.7 2.6 6.9 2.3 virginica
# 115 5.8 2.8 5.1 2.4 virginica
# 133 6.4 2.8 5.6 2.2 virginica
# 136 7.7 3.0 6.1 2.3 virginica
# 146 6.7 3.0 5.2 2.3 virginica
# 142 6.9 3.1 5.1 2.3 virginica
gives you the selected rows.
The package "gatepoints" available on CRAN will allow you to draw a gate returning your points of interest.
The explanation is quite clear for anyone who reads the question. The link simply links to a package that can be used as follows:
First plot your points
x <- data.frame(x=1:10, y=1:10)
plot(x, col = "red", pch = 16)
Then select your points after running the following commands:
selectedPoints <- fhs(x)
This will return:
selectedPoints
#> [1] "4" "5" "7"
#> attr(,"gate")
#> x y
#> 1 6.099191 8.274120
#> 2 8.129107 7.048649
#> 3 8.526881 5.859404
#> 4 5.700760 6.716428
#> 5 5.605314 5.953430
#> 6 6.866882 3.764390
#> 7 3.313575 3.344069
#> 8 2.417270 5.217868

R- reduce dimensionality LSA

I am following an example of svd, but I still don't know how to reduce the dimension of the final matrix:
a <- round(runif(10)*100)
dat <- as.matrix(iris[a,-5])
rownames(dat) <- c(1:10)
s <- svd(dat)
pc.use <- 1
recon <- s$u[,pc.use] %*% diag(s$d[pc.use], length(pc.use), length(pc.use)) %*% t(s$v[,pc.use])
But recon still have the same dimension. I need to use this for Semantic analysis.
The code you provided does not reduce the dimensionality. Instead it takes first principal component from your data, removes the rest of principal components, and then reconstructs the data with only one PC.
You can check that this is happening by inspecting the rank of the final matrix:
library(Matrix)
rankMatrix(dat)
as.numeric(rankMatrix(dat))
[1] 4
as.numeric(rankMatrix(recon))
[1] 1
If you want to reduce dimensionality (number of rows) - you can select some principal principal components and compute the scores of your data on those components instead.
But first let's make some things clear about your data - it seems you have 10 samples (rows) with 4 features (columns). Dimensionality reduction will reduce the 4 features to a smaller set of features.
So you can start by transposing your matrix for svd():
dat <- t(dat)
dat
1 2 3 4 5 6 7 8 9 10
Sepal.Length 6.7 6.1 5.8 5.1 6.1 5.1 4.8 5.2 6.1 5.7
Sepal.Width 3.1 2.8 4.0 3.8 3.0 3.7 3.0 4.1 2.8 3.8
Petal.Length 4.4 4.0 1.2 1.5 4.6 1.5 1.4 1.5 4.7 1.7
Petal.Width 1.4 1.3 0.2 0.3 1.4 0.4 0.1 0.1 1.2 0.3
Now you can repeat the svd. Centering the data before this procedure is advisable:
s <- svd(dat - rowMeans(dat))
Principal components can be obtained by projecting your data onto PCs.
PCs <- t(s$u) %*% dat
Now if you want to reduce dimensionality by eliminating PCs with low variance you can do so like this:
dat2 <- PCs[1:2,] # would select first two PCs.

Quickest way to read a subset of rows of a CSV

I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
The solution deserving of a bounty will give system.time() estimates of different alternatives.
Other:
I am using Linux
I don't need exactly 10% of the rows. Just approximately 10%.
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
This takes a random sample of N=5 for the iris dataset.
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
Here's a file with 100000 lines in it like this:
"","a","b","c"
"1",0.825049088569358,0.556148858508095,0.591679535107687
"2",0.161556158447638,0.250450366642326,0.575034103123471
"3",0.676798462402076,0.0854280597995967,0.842135070590302
"4",0.650981109589338,0.204736212035641,0.456373531138524
"5",0.51552157686092,0.420454133534804,0.12279288447462
$ wc -l d.csv
100001 d.csv
So that's 100000 lines plus a header. We want to keep the header and sample each line if a random number from 0 to 1 is greater than 0.9.
$ awk 'NR==1 {print} ; rand()>.9 {print}' < d.csv >sample.csv
check:
$ head sample.csv
"","a","b","c"
"12",0.732729186303914,0.744814146542922,0.199768838472664
"35",0.00979996216483414,0.633388962829486,0.364802648313344
"36",0.927218825090677,0.730419414117932,0.522808947600424
"42",0.383301998255774,0.349473554175347,0.311060158303007
and it has 10027 lines:
$ wc -l sample.csv
10027 sample.csv
This took 0.033s of real time on my 4-yo box, probably the HD speed is the limiting factor here. It should scale linearly since the file is being dealt with strictly line-by-line.
You then read in sample.csv using read.csv or fread as desired:
> s = fread("sample.csv")
You could use sqldf::read.csv.sql and an SQL command to pull the data in:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE) # write a csv file to test with
read.csv.sql("iris.csv","SELECT * FROM file ORDER BY RANDOM() LIMIT 10")
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 6.3 2.8 5.1 1.5 virginica
2 4.6 3.1 1.5 0.2 setosa
3 5.4 3.9 1.7 0.4 setosa
4 4.9 3.0 1.4 0.2 setosa
5 5.9 3.0 4.2 1.5 versicolor
6 6.6 2.9 4.6 1.3 versicolor
7 4.3 3.0 1.1 0.1 setosa
8 4.8 3.4 1.9 0.2 setosa
9 6.7 3.3 5.7 2.5 virginica
10 5.9 3.2 4.8 1.8 versicolor
It doesn't calculate the 10% for you, but you can choose the absolute limit of rows to return.

Resources