Split data based on column values and create scatter plot. - r

I need to make a scatter plot for days vs age for the f group (sex=1) and make another scatter plot for days vs age for the m group (sex=2) using R.
days age sex
306 74 1
455 67 2
1000 55 1
505 65 1
399 54 2
495 66 2
...
How do I extract the data by sex? I know after that to use plot() function to create a scatter plot.
Thank you!

You can do this with the traditional R graphics functions like:
plot(age ~ days, Data[Data$sex == 1, ])
plot(age ~ days, Data[Data$sex == 2, ])
If you prefer to color the points rather than separate the plots (which might be easier to understand) you can do:
plot(age ~ days, Data, col=Data$sex)
However, this kind of plot would be especially easy (and better looking) using ggplot2:
library(ggplot2)
ggplot(Data, aes(x=days, y=age)) + geom_point() + facet_wrap(~sex)

spread splits data by column values. This is also called converting data from "long" to "wide".
I haven't tested this, but something like
spread(data, sex, age)
should get you
days 1 2
306 74 NA
455 NA 67
1000 55 NA
505 65 NA
399 NA 54
495 NA 66

Related

Plotting each value of columns for a specific row

I am struggling to plot a specific row from a dataframe. Below is the Graph i am trying to plot. I have tried using ggplot and normal plot but i cannot figure it out.
Wt2 Wt3 Wt4 Wt5 Lngth2 Lngth3 Lngth4 Lngth5
1 48 59 95 82 141 157 168 183
2 59 68 102 102 140 168 174 170
3 61 77 93 107 145 162 172 177
4 54 43 104 104 146 159 176 171
5 100 145 185 247 150 158 168 175
6 68 82 95 118 142 140 178 189
7 68 95 109 111 139 171 176 175
Above is the Data frame I am trying to plot with. The rows are for each bears measurement. So row 1 is for bear 1. How would I plot only the Wt columns for bear 1 against an X-axis that goes from years 2 to 5
You can pivot your data frame into a longer format:
First add a column with the row number (bear number):
df = cbind("Bear"=as.factor(1:nrow(df)), df)
It needs to be factor so we can pass it as a group variable to ggplot. Now pivot:
df2 = tidyr::pivot_longer(df[,1:5], cols=2:5,
names_to="Year", values_to="Weight", names_prefix="Wt")
df2$Year = as.numeric(df2$Year)
We ignore the Length columns with df[,1:5]; say that we only want to pivot the weight columns with df[,2:5]; then say the name of the columns we want to create with names_to and values_to; and lastly the names_prefix="Wt" removes the "Wt" before the column names, leaving only the year number, but we get a character, so we need to make it numeric with as.numeric().
Then plot:
ggplot(df2, aes(x=Year, y=Weight, linetype=Bear)) + geom_line()
Output (Ps: i created my own data, so the actual numbers are off):
Just an addition, if you don't want to specify the columns of your dataset explicity, you can do:
df2 = df2[,grep("Wt|Bear", colnames(df)]
df2 = tidyr::pivot_longer(df2, cols=grep("Wt", colnames(df2)),
names_to="Year", values_to="Weight", names_prefix="Wt")
Edit: one plot for each group
You can use facet_wrap:
ggplot(df2, aes(x=Year, y=Weight, linetype=Bear)) +
facet_wrap(~Bear, nrow=2, ncol=4) +
geom_line()
Output:
You can change the nrow and ncol as you wish, and can remove the linetype from aes() as you already have a differenciation, but it's not mandatory.
You can also change the levels of the categorical data to make the labels on each graph better, do levels(df2$Bear) = paste("Bear", 1:7) for example (or do that the when creating it).
Try
ggplot(mapping = aes(x = seq.int(2, 5), y = c(48, 59, 95, 82))) +
geom_point(color = "blue") +
geom_line(color = "blue") +
xlab("Year") +
ylab("Weight")

Table in r to be weighted

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable.
Here is some sample data.
set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)
I've run this to get a regular crosstab table
table(df$sex, df$age)
to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)
library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1
I'm not sure where I've gone wrong, but it doesn't run, so any help will be great.
Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.
Try this
GDAtools::wtable(df$sex, df$age, w = df$wgt)
Output
0-15 16-29 30-44 45+ NA tot
Female 56 73 60 76 0 265
Male 76 99 106 90 0 371
NA 0 0 0 0 0 0
tot 132 172 166 166 0 636
Update
In case you do not want to install the whole package, here are two essential functions you need:
wtable and dichotom
Source them and you should be able to use wtable without any problem.
A solution is to repeat the rows of the data.frame by weight and then table the result.
The following repeats the data.frame's rows (only relevant columns):
df[rep(row.names(df), df$wgt), 1:2]
And it can be used to get the contingency table.
table(df[rep(row.names(df), df$wgt), 1:2])
# sex
#age Female Male
# 0-15 56 76
# 16-29 73 99
# 30-44 60 106
# 45+ 76 90
Base R, in stats, has xtabs for exactly this:
xtabs(wgt ~ age + sex, data=df)
A tidyverse solution using your data same set.seed, uncount is the equivalent to #Rui's rep of the weights.
library(dplyr)
library(tidyr)
df %>%
uncount(weights = .$wgt) %>%
select(-wgt) %>%
table
#> sex
#> age Female Male
#> 0-15 56 76
#> 16-29 73 99
#> 30-44 60 106
#> 45+ 76 90

How do I plot from data frames?

From the following code, I got a data frame in R. I am trying to plot the data frame; however, I am only interested in the score they got on the Final. So I want the x-axis to be the number of students, which is 6, since that's how many data points their are, and I want the y-axis to be Final. Is there a way to do this from just the data frame?
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35), Final=c(63,87,89,45,99,18))
Output listed below:
Score1 Score2 Final
1 100 56 63
2 36 68 87
3 58 68 89
4 77 98 45
5 99 15 99
6 92 35 18
Or will I have to do something like this instead? But this gives me an error that the lengths are not the same.
data <- data.frame(Score1=c(100,36,58,77,99,92),Score2=c(56,68,68,98,15,35))
Final=c(63,87,89,45,99,18)
f.data <- cbind(data,Final)
b <- 6
plot(b,Final)
Use the following
library(ggplot2);
qplot( x = 1:6, y = data$Final)
The code below can do the trick.
plot(data$Final)

Violin plot from summary data

I'd like to use a violin plot to visualise the number of archaeological artefacts by site (A and B) and by century with data in the following format (years are Before Present):
Year SiteA SiteB
22400 356 182
22500 234 124
22600 144 231
22700 12 0
...
24800 112 32
There are some 6000 artefacts in total. In ggplot2, it would seem as if the preferred data entry format is of one line per observation (artefact) for a violin plot:
Site Year
A 22400
A 22400
... (356 times)
A 22400
B 22400
B 22400
... (182 times)
A 22500
A 22500
... (234 times)
A 22500
... ... ... (~5000 lines)
B 24800
B 24800
... (32 times)
B 24800
Is there an effective way of converting summary dataframe (1st grey box) into an observation-by-observation dataframe (2nd grey box) for use in a violin plot?
Alternatively, is there a way of making violin plots from data formatted as in the first grey box?
Update:
With the answer provided by eipi10, if either Site A or B has zero artefacts (as in the updated example above for the year 22,700), I get the following error:
Error in data.frame(Year = rep(dat$Year[i], dat$value[i]), Site = dat$key[i]) :
arguments imply differing number of rows: 0, 1
The plot would look like this:
How about this:
library(tidyverse)
dat = read.table(text="Year SiteA SiteB
22400 356 182
22500 234 124
22600 144 231
24800 112 32", header=TRUE, stringsAsFactors=FALSE)
dat = gather(dat, key, value, -Year)
dat.long = data.frame(Year = rep(dat$Year, dat$value), Site=rep(dat$key, dat$value))
ggplot(dat.long, aes(Site, Year)) +
geom_violin()

Barplots for all levels (even those with no values) in R

I'm graphing values associated with a range of factors I got from cut:
a b
1 (25,30] 10
2 (30,35] 313
3 (35,40] 904
4 (40,45] 809
5 (45,50] 608
6 (50,55] 514
7 (55,60] 227
8 (60,65] 323
9 (65,70] 23
10 (70,75] 5
11 (75,80] 1
I graph it with:
plt_tmp = barplot(agg$b)
axis(1, agg$a, at=plt_tmp,las=2)
This would be fine, but the levels are actually (ie levels(agg$a)):
[1] "(0,5]" "(5,10]" "(10,15]" "(15,20]" "(20,25]" "(25,30]" "(30,35]" "(35,40]" "(40,45]"
[10] "(45,50]" "(50,55]" "(55,60]" "(60,65]" "(65,70]" "(70,75]" "(75,80]" "(80,85]" "(85,90]"
[19] "(90,95]" "(95,100]"
And I was hoping I could graph all of them, including the missing ones, as 0 values. How would I go about doing this? Any help would be very much appreciated!
You need to merge all your levels back against your base data, and then pass that data to barplot. With a simplified example:
agg <- data.frame(a=factor(c(1,2,4), levels=1:5), b=c(10,1,20))
with(merge(agg, list(a=levels(agg$a)), all=TRUE), barplot(b, names.arg=a) )
Use package ggplot2 for your barchart. With scale option you can define wether unused levels are plotted or not:
library(ggplot2)
ggplot(agg, aes(x=a, y=b)) +
geom_bar(stat="identity") +
scale_x_discrete(drop=FALSE)

Resources