I have a large dataset (14295,58). Each column is a different element from the periodic table (e.g. Fe, Ca, Zr) and the rows are arranged according to depth (in mm); the last column is the depth value. I am trying to make a code that can be customized to a given group of elements over a given depth interval but I don't want to have to go through and change a bunch of lines of code everytime I look at a different subset. So far I have created a dataframe called Section:
Section <- df[50:100,]
and a vector called Elements:
Elements <- c("Fe", "Ca", "Zr")
I can subsample the Section data frame by:
Section %>%
select(., Elements, depth)
but now I want to plot this with ggplot and I can't figure out how to call the Elements vector to the x-variable. I tried:
Section %>%
select(., Elements, depth) %>%
ggplot() +
geom_path (aes(Elements, depth))
but the arguments don't have the same length. How can I plot the selected elements from the Elements vector?
I think your problem is actually that your data is not formatted in the most useful way (wide vs. long), so you aren't actually giving ggplot what you think you are. If you give it a vector as an aesthetic (Elements here), it will try its best to plot it. In this case, it will do it if the length matches by just matching up values in depth to things in Elements. So this works:
# Toy Data
df <- data.frame(O = 1:3,
Fe = 2:4,
Ca = 3:5,
Zr = 4:6,
depth = 5:7)
Elements <- c('Fe', 'Ca', 'Zr')
ggplot(df) +
geom_point(aes(x=Elements, y=depth))
But it just matches the first depth to 'Fe', the second depth to 'Ca', etc. I don't think that's what you are hoping to have happen.
Long vs Wide Data
You have separate columns for every all these elements, but do they actually represent different things? You are probably better off re-formatting your data so that all these "element" columns get collapsed into key-value pairs using tidyr:
# Wide:
df
O Fe Ca Zr depth
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
# Long
library(tidyr)
longDf <- tidyr::gather(df, element, amount, -depth)
longDf
depth element amount
1 5 O 1
2 6 O 2
3 7 O 3
4 5 Fe 2
5 6 Fe 3
6 7 Fe 4
7 5 Ca 3
8 6 Ca 4
9 7 Ca 5
10 5 Zr 4
11 6 Zr 5
12 7 Zr 6
Now you can get the elements you want using dplyr's filter (which is also probably a better option for subsetting by depth) and use the new element column as the x coordinate for plotting:
longDf %>%
filter(element %in% Elements) %>%
ggplot() +
geom_path(aes(x=element, y=depth))
I'm not sure what you're expecting the graph to look like, but that should get you started.
Related
This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Closed 2 years ago.
for eg:
a dataframe "housing" has a column "street" with different street names as levels.
I want to return a df with counts of the number of houses in each street (level), basically number of repetitions.
what functions do i use in r?
This should help:
library(dplyr)
housing %>% group_by(street) %>% summarise(Count=n())
This can be done in multiple ways, for instance with base R using table():
table(housing$street)
It can also be done through dplyr, as illustrated by Duck.
Another option (my preference) is using data.table.
library(data.table)
setDT(housing)
housing[, .N, by = street]
summary gives the first 100 frequencies of the factor levels. If there are more, try:
table(housing$street)
For example, let's generate one hundred one-letter street names and summarise them with table.
set.seed(1234)
housing <- data.frame(street = sample(letters, size = 100, replace = TRUE))
x <- table(housing$street)
x
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# 1 3 5 6 4 6 2 6 5 3 1 3 1 2 5 5 4 1 5 5 3 7 4 5 3 5
As per OP's comment. To further use the result in analyses, it needs to be included in a variable. Here, the x. The class of the variable is table, and it works in base R with most functions as a named vector. For example, to find the most frequent street name, use which.max.
which.max(x)
# v
# 22
The result says that the 22nd position in x has the maximum value and it is called v.
I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0
I have data saved in a text file with couple thousands line. Each line only has one value. Like this
52312
2
3
4
5
7
9
4
5
3
The first value is always roughly 10.000 times bigger than all the other values.
I can read in the data with data<-read.table("data.txt")
When I just use plot(data) all the data have the same y-value, resulting in a line, where the x values just represent the values given from the data.
What I want, however, is that the x-value represents the linenumber and y-value the actual data. So for the above example my values would be (1,52312), (2,2), (3,3), (4,4), (5,5), (6,7), (7,9), (8,4), (9,5), (10,3).
Also, since the first value is way higher than all the other values, I'd like to use a log scale for the y-axis.
Sorry, very new to R.
set.seed(1000)
df = data.frame(a=c(9999999,sample(2:78,77,replace = F)))
plot(x=1:nrow(df), y=log(df$a))
i) set.seed(1000) helps you reproduce the same random numbers from sample() each time you run this code. It makes code reproducible.
ii) type ?sample in R console for documentation.
iii) since you wanted the x-axis to be linenumber - I create it using ":" operator. 1:3 = 1,2,3. Similarily I created a "id" index using 1:nrow(df) which will create based on the dimension of your data.
iv) for log ,just use it simple :). read more about ?plot and its parameters
Try this:
df
x y
1 1 52312
2 2 2
3 3 3
4 4 4
5 5 5
6 6 7
7 7 9
8 8 4
9 9 5
10 10 3
library(ggplot2)
ggplot(df, aes(x, y)) + geom_point(size=2) + scale_y_log10()
I am a novice R user, hence the question. I refer to the solution on creating stacked barplots from R programming: creating a stacked bar graph, with variable colors for each stacked bar.
My issue is slightly different. I have 4 column data. The last column is the summed total of the first 3 column. I want to plot bar charts with the following information 1) the summed total value (ie 4th column), 2) each bar is split by the relative contributions of each of the three column.
I was hoping someone could help.
Regards,
Bernard
If I understood it rightly, this may do the trick
the following code works well for the example df dataframe
df <- a b c sum
1 9 8 18
3 6 2 11
1 5 4 10
23 4 5 32
5 12 3 20
2 24 1 27
1 2 4 7
As you don't want to plot a counter of variables, but the actual value in your dataframe, you need to use the goem_bar(stat="identity") method on ggplot2. Some data manipulation is necessary too. And you don't need a sum column, ggplot does the sum for you.
df <- df[,-ncol(df)] #drop the last column (assumed to be the sum one)
df$event <- seq.int(nrow(df)) #create a column to indicate which values happaned on the same column for each variable
df <- melt(df, id='event') #reshape dataframe to make it readable to gpglot
px = ggplot(df, aes(x = event, y = value, fill = variable)) + geom_bar(stat = "identity")
print (px)
this code generates the plot bellow
I have some data that I want to display as a box plot using ggplot2. It's basically counts, stratified by two other variables. Here's an example of the data (in reality there's a lot more, but the structure is the same):
TAG Count Condition
A 5 1
A 6 1
A 6 1
A 6 2
A 7 2
A 7 2
B 1 1
B 2 1
B 2 1
B 12 2
B 8 2
B 10 2
C 10 1
C 12 1
C 13 1
C 7 2
C 6 2
C 10 2
For each Tag, there are a fixed number of observations in condition 1, and condition 2 (here it's 3, but in the real data it's much more). I want a box plot like the following ('s' is a dataframe arranged as above):
ggplot(s, aes(x=TAG, y=Count, fill=factor(Condition))) + geom_boxplot()
This is fine, but I want to be able to order the x-axis by the p-value from a Wilcoxon test for each Tag. For example, with the above data, the values would be (for the tags A,B, and C respectively):
> wilcox.test(c(5,6,6),c(6,7,7))$p.value
[1] 0.1572992
> wilcox.test(c(1,2,2),c(12,8,10))$p.value
[1] 0.0765225
> wilcox.test(c(10,12,13),c(7,6,10))$p.value
[1] 0.1211833
Which would induce the ordering A,C,B on the x-axis (largest to smallest). But I don't know how to go about adding this information into my data (specifically, attaching a p-value at just the tag level, rather than adding a whole extra column), or how to use it to change the x-axis order. Any help greatly appreciated.
Here is a way do it. The first step is to calculate the p-values for each TAG. We do this by using ddply which splits the data by TAG, and calculates the p-value using the formula interface to wilcox.test. The plot statement reorders the TAG based on its p-value.
library(ggplot2); library(plyr)
dfr2 <- ddply(dfr, .(TAG), transform,
pval = wilcox.test(Count ~ Condition)$p.value)
qplot(reorder(TAG, pval), Count, fill = factor(Condition), geom = 'boxplot',
data = dfr2)