R boxplot ggplot issues - r

I am new in R and am trying to so some graphics using ggplot and a bit of reverse engineering. I have a data frame as:
> data
experiments percentages
1 A 72.11538
2 A 90.62500
3 A 91.52542
4 B 94.81132
5 B 96.95122
6 B 98.95833
7 C 83.75000
8 C 84.84848
9 C 91.12903
because A and B are similar experiments I do the following
data$experiments[data$experiments == "B"] = "A"
If I do now
ggplot(data, aes(x = experiments, y = percentages)) + geom_boxplot()
I get one box for A, one for C but still I get a label for B!
Is there any way of getting rid of B on the X axis?
Thanks a lot for your help

I'm guessing that experiments in data is a factor. If you run str(data), I imagine that experiments is a factor with 3 levels: A, B, and C. By default, strings are turned into factors when a data frame is created.
The idea of factors is that they represent a set of possible values, even if not all the possibilities are in the actual data. There are two ways to fix this.
Convert the column to a string
data$experiments <- as.character(data$experiments)
Or remove the unused level in the factor
data$experiments <- droplevels(data$experiment)

Related

Creating a variable with randomized number from an old variable (with a higher population)

I am not very good at R, and I have a problem. As I want to do a linear regression between two variables from different datasets, i run into the proble, that one dataset is way bigger than the other. So, in order to bypass that problem, I want to create a smaller variable with an equal population, randomly selected from the greater datasets variable. What is the command for that? And if any specification is needed for that, please let me know! Thank you so much for your help!
Tried to make a liner regression out of two datasets, but as one is bigger than the other, it did not help, and the line (error)
Error in model.frame.default(formula = lobby_expenditure$expend \~ compustat$lct, :
variable lengths differ (found for 'compustat$lct')
appeared
Here is a simple example; y comes from d2 and a sample of rows from d1 are selected for x
d1=data.frame(x=rnorm(100))
d2=data.frame(y=rnorm(10))
lm(d2$y~d1[sample(1:nrow(d1),nrow(d2)),"x"])
To get any sample rows, use dplyr::sample_n
Example : dataset :
df2 <- read_table('Individual Site
1 A
2 B
3 A
4 C
5 C
6 B
7 A
8 B
9 C')
with sample_n(df2,2) where 2 is number of samples you want, you can get random rows. The following output may differ in your case since its random.
#A tibble: 2 x 2
Individual Site
<dbl> <chr>
1 4 C
2 5 C

Columns in data.frame appear empty when using head(data.frame), but when using levels(data.frame$column1) appear to have values

I have a large data frame that includes names of sites, their latitude, and the longitude, among other data. When I write head(data.frame), the column with the site names looks correct, but the latitude and longitude columns are empty (there are no values at all in them). However, when I write levels(data.frame$longitude), all of the values for the longitudes of my site appear. The same issue is occurring with latitude.
I am wondering why these values don't appear when I look up the head of the data.frame, but do appear when I look up the level?
Many thanks!
levels() returns the levels associated with a factor. Not all values need to exist. For example, if you do
x <- factor(letters)
x[1:3]
you'll see
[1] a b c
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
because all 26 levels are still associated with the factor, even though only 3 of them are observed.
In your case, it's possible that one of the levels of your factor is a blank string. You can get this using
x <- factor("", levels = c("", letters))
or by reading in data containing blanks in some columns.
To find out all the values in a column, don't use levels(x), use unique(x). That will still show all levels if x is a factor, but it will also show which values actually occur in the column.

How to plot recurrencies in R

How can I plot a recurrency in R.
Any solution with base plot, ggplot2, lattice, or a dedicated package is welcome.
For example:
Imagine I have these data:
mydata <- data.frame(t=1:10, Y=runif(10))
t Y
1 0.3744869
2 0.6314202
3 0.3900789
4 0.6896278
5 0.6894134
6 0.5549006
7 0.4296244
8 0.4527201
9 0.3064433
10 0.5783539
I could transform it like this:
mydata2 <- data.frame(t=c(NA,mydata$t),Y=c(NA,mydata$Y),Y2=c(mydata$Y, NA))
t Y Y2
NA NA 0.9103703
1 0.9103703 0.1426041
2 0.1426041 0.4150476
3 0.4150476 0.2109258
4 0.2109258 0.4287504
5 0.4287504 0.1326900
6 0.1326900 0.4600964
7 0.4600964 0.9429571
8 0.9429571 0.7619739
9 0.7619739 0.9329098
10 0.9329098 NA
(or similar methods, but I can have problems with missing data)
And plot it
plot(Y2~Y, data=mydata2)
I guess I must use some grouping function such as ave or apply. But it's not an elegant solution, and if I have more columns it can become difficult to generalize the transformation.
For example
mydata3 <- data.frame(x=sample(10,100, replace=T),t=1:100, Y=2*runif(100)+1)
For every x (or combination of values on other columns) I want to plot Y_{i+1} ~ Y_i, on the same plot.
Other tools, such as Mathematica have functions to plot sequences directly.
I've found a solution, thoug not very beautiful:
For this sample data.
mydata <- data.frame(x=sample(4,25, replace=T),t=1:25, Y=2*runif(25)+1)
newdata <- mydata[order(mydata$x, mydata$t), ]
newdata$prev <- ave(newdata$Y, newdata$x, FUN=function(x) c(NA,head(x,-1)))
plot(Y~prev, data=newdata)
In this example you don't have rows for every t value, you would need to first generate NAs for missing values. But it's just a quick solution. In my real data I have many observations for each t.
lag.plot can plot recurrence plots but not within each subgroup.

Simple line plot using R ggplot2

I have data as follows in .csv format as I am new to ggplot2 graphs I am not able to do this
T L
141.5453333 1
148.7116667 1
154.7373333 1
228.2396667 1
148.4423333 1
131.3893333 1
139.2673333 1
140.5556667 2
143.719 2
214.3326667 2
134.4513333 3
169.309 8
161.1313333 4
I tried to plot a line graph using following graph
data<-read.csv("sample.csv",head=TRUE,sep=",")
ggplot(data,aes(T,L))+geom_line()]
but I got following image it is not I want
I want following image as follows
Can anybody help me?
You want to use a variable for the x-axis that has lots of duplicated values and expect the software to guess that the order you want those points plotted is given by the order they appear in the data set. This also means the values of the variable for the x-axis no longer correspond to the actual coordinates in the coordinate system you're plotting in, i.e., you want to map a value of "L=1" to different locations on the x-axis depending on where it appears in your data.
This type of fairly non-sensical thing does not work in ggplot2 out of the box. You have to define a separate variable that has a proper mapping to values on the x-axis ("id" in the code below) and then overwrite the labels with the values for "L".
The coe below shows you how to do this, but it seems like a different graphical display would probbaly be better suited for this kind of data.
data <- as.data.frame(matrix(scan(text="
141.5453333 1
148.7116667 1
154.7373333 1
228.2396667 1
148.4423333 1
131.3893333 1
139.2673333 1
140.5556667 2
143.719 2
214.3326667 2
134.4513333 3
169.309 8
161.1313333 4
"), ncol=2, byrow=TRUE))
names(data) <- c("T", "L")
data$id <- 1:nrow(data)
ggplot(data,aes(x=id, y=T))+geom_line() + xlab("L") +
scale_x_continuous(breaks=data$id, labels=data$L)
You have an error in your code, try this:
ggplot(data,aes(x=L, y=T))+geom_line()
Default arguments for aes are:
aes(x, y, ...)

Using Histogram as input in R

This is admittedly a very simple question that I just can't find an answer to.
In R, I have a file that has 2 columns: 1 of categorical data names, and the second a count column (count for each of the categories). With a small dataset, I would use 'reshape' and the function 'untable' to make 1 column and do analysis that way. The question is, how to handle this with a large data set?
In this case, my data is humungous and that just isn't going to work.
My question is, how do I tell R to use something like the following as distribution data:
Cat Count
A 5
B 7
C 1
That is, I give it a histogram as an input and have R figure out that it means there are 5 of A, 7 of B and 1 of C when calculating other information about the data.
The desired input rather than output would be for R to understand that the data would be the same as follows,
A
A
A
A
A
B
B
B
B
B
B
B
C
In reasonable size data, I can do this on my own, but what do you do when the data is very large?
Edit
The total sum of all the counts is 262,916,849.
In terms of what it would be used for:
This is new data, trying to understand the correlation between this new data and other pieces of data. Need to work on linear regressions and mixed models.
I think what you're asking is to reshape a data frame of categories and counts into a single vector of observations, where categories are repeated. Here's one way:
dat <- data.frame(Cat=LETTERS[1:3],Count=c(5,7,1))
# Cat Count
#1 A 5
#2 B 7
#3 C 1
rep.int(dat$Cat,times=dat$Count)
# [1] A A A A A B B B B B B B C
#Levels: A B C
To follow up on #Blue Magister's excellent answer, here's a 100,000 row histogram with a total count of 551,245,193:
set.seed(42)
Cat <- sapply(rep(10, 100000), function(x) {
paste(sample(LETTERS, x, replace=TRUE), collapse='')
})
dat <- data.frame(Cat, Count=sample(1000:10000, length(Cat), replace=TRUE))
> head(dat)
Cat Count
1 XYHVQNTDRS 5154
2 LSYGMYZDMO 4724
3 XDZYCNKXLV 8691
4 TVKRAVAFXP 2429
5 JLAZLYXQZQ 5704
6 IJKUBTREGN 4635
This is a pretty big dataset by my standards, and the operation Blue Magister describes is very quick:
> system.time(x <- rep(dat$Cat,times=dat$Count))
user system elapsed
4.48 1.95 6.42
It uses about 6GB of RAM to complete the operation.
This really depends on what statistics you are trying to calculate. The xtabs function will create tables for you where you can specify the counts. The Hmisc package has functions like wtd.mean that will take a vector of weights for computing a mean (and related functions for standard deviation, quantiles, etc.). The biglm package could be used to expand parts of the dataset at a time and analyze. There are probably other packages as well that would handle the frequency data, but which is best depends on what question(s) you are trying to answer.
The existing answers are all expanding the pre-binned dataset into a full distribution and then using R's histogram function which is memory inefficient and will not scale for very large datasets like the original poster asked about. The HistogramTools CRAN package includes a
PreBinnedHistogram function which takes arguments for breaks and counts to create a Histogram object in R without massively expanding the dataset.
For Example, if the data set has 3 buckets with 5, 7, and 1 elements, all of the other solutions posted here so far expand that into a list of 13 elements first and then create the histogram. PreBinnedHistogram in contrast creates the histogram directly from the 3 element input list without creating a much larger intermediate vector in memory.
big.histogram <- PreBinnedHistogram(my.data$breaks, my.data$counts)

Resources