This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Multiple functions in aggregate
(5 answers)
Closed 2 months ago.
May I ask for help with R coding? I would like to calculate the simple mean and standard deviation as shown below. This is an example of calculating it for species richness. However, I would like to calculate it for several species, each one is in a separate column (species1, species2, species3, species4 etc).
How can I do it automatically (I guess using some loop or funcion) in R and get a nice overview table where calculations for each species come one by one in below?
mean1=tapply(edge2$species_richness, list (Management =edge2$Management), mean) sd1=tapply(edge2$species_richness, list (Management =edge2$Management), sd)
cbind (mean1, sd1)
Result for species richness:
mean1 sd1
AES 15.6250 5.875089
AES2 29.5000 9.570789
Control 6.9375 8.590450
Centre 16.3125 5.437141
I ask help with R coding
Using aggregate on mtcars (the aggregation variable is in the rows, as is yours, the aggregated variables are in the columns)
> aggregate(.~am,data=mtcars,function(x){c("mean"=mean(x),"sd"=sd(x))})
am mpg.mean mpg.sd cyl.mean cyl.sd disp.mean disp.sd hp.mean hp.sd drat.mean drat.sd wt.mean
1 0 17.1 3.8 6.9 1.5 290.4 110.2 160.3 53.9 3.3 0.4 3.8
2 1 24.4 6.2 5.1 1.6 143.5 87.2 126.8 84.1 4.0 0.4 2.4
wt.sd qsec.mean qsec.sd vs.mean vs.sd gear.mean gear.sd carb.mean carb.sd
1 0.8 18.2 1.8 0.4 0.5 3.2 0.4 2.7 1.1
2 0.6 17.4 1.8 0.5 0.5 4.4 0.5 2.9 2.2
Related
This question already has answers here:
Create loop with dynamic column names and repeating values based on defined i
(1 answer)
How to use mutate and ifelse in a loop?
(3 answers)
How can I dynamically create new variables/columns on databases in R using dplyr?
(2 answers)
How to use mutate from dplyr to create a series of columns defined and called by a vector specifying values for mutation?
(1 answer)
dplyr apply a single function with changing argument to the same column
(2 answers)
Closed 1 year ago.
Let me clarify I am not looking at mutate_at or mutate(across(..., ...)) type of syntax here. I just want to know how to create several new columns at once inside tidyverse pipe syntax.
Let us assume the case of iris dataset.
I want to create say 10 (or 100 or more) new columns having a criteria like this.
first new column(variable) say V1 is just Petal.Length * 1,
second new col say V2 is Petal.Length * 2
and so on upto say V10 Petal.Length * 10
without explicitly writing the names and formula for each of these columns, which may be cumbersome If I want to create say 100 new columns.
You can use map functions :
library(dplyr)
library(purrr)
df <- iris %>% head
value <- 1:5
bind_cols(df,
map_dfc(value, ~df %>% transmute(!!paste0('col', .x) := Petal.Length * .x)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species col1 col2 col3 col4 col5
#1 5.1 3.5 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#2 4.9 3.0 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#3 4.7 3.2 1.3 0.2 setosa 1.3 2.6 3.9 5.2 6.5
#4 4.6 3.1 1.5 0.2 setosa 1.5 3.0 4.5 6.0 7.5
#5 5.0 3.6 1.4 0.2 setosa 1.4 2.8 4.2 5.6 7.0
#6 5.4 3.9 1.7 0.4 setosa 1.7 3.4 5.1 6.8 8.5
In base R, this can be done with lapply :
df[paste0('col', value)] <- lapply(value, `*`, df$Petal.Length)
This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 3 years ago.
I am trying to find data where three out of four columns are duplicated, and then to remove duplicates but keep the row with the largest number for the otherwise identical data.
I found this very helpful article on the StackOverflow which I think gets me about half way there.
I will base my question of the example in that question. (The example has more columns than what I am working on but I don' think that really matters.)
require(tidyverse)
x = iris%>%select(-Petal.Width)
dups = x[x%>%duplicated(),]
answer = iris%>%semi_join(dups)
> answer
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.1 1.5 0.1 setosa
3 4.8 3.0 1.4 0.1 setosa
4 5.1 3.5 1.4 0.3 setosa
5 4.9 3.1 1.5 0.2 setosa
6 4.8 3.0 1.4 0.3 setosa
7 5.8 2.7 5.1 1.9 virginica
8 6.7 3.3 5.7 2.1 virginica
9 6.4 2.8 5.6 2.1 virginica
10 6.4 2.8 5.6 2.2 virginica
11 5.8 2.7 5.1 1.9 virginica
12 6.7 3.3 5.7 2.5 virginica
That article introduced me to code that will identify all rows where everything is equal except petal width:
iris[duplicated(iris[,-4]) | duplicated(iris[,-4], fromLast = TRUE),]
This is great but I don't know how to progress from here. I would like to have rows 2 and 5 to collapse into a single row that is equal to row 5. Similarly 9 & 10, should become just 10, and 8 & 12 become just 12.
The data set I have has more than 2 rows in some sets of duplicates, so I haven't had any luck using arrange functions to order them and delete the smallest row.
This should do what you want
iris %>%
group_by(Sepal.Length,
Sepal.Width,
Petal.Length,
Species) %>%
filter(Petal.Width == max(Petal.Width)) %>%
filter(row_number() == 1) %>%
ungroup()
The second filtering is to get rid of duplicates if the Petal.Width is also identical for two entries. Does this work for you?
I am following an example of svd, but I still don't know how to reduce the dimension of the final matrix:
a <- round(runif(10)*100)
dat <- as.matrix(iris[a,-5])
rownames(dat) <- c(1:10)
s <- svd(dat)
pc.use <- 1
recon <- s$u[,pc.use] %*% diag(s$d[pc.use], length(pc.use), length(pc.use)) %*% t(s$v[,pc.use])
But recon still have the same dimension. I need to use this for Semantic analysis.
The code you provided does not reduce the dimensionality. Instead it takes first principal component from your data, removes the rest of principal components, and then reconstructs the data with only one PC.
You can check that this is happening by inspecting the rank of the final matrix:
library(Matrix)
rankMatrix(dat)
as.numeric(rankMatrix(dat))
[1] 4
as.numeric(rankMatrix(recon))
[1] 1
If you want to reduce dimensionality (number of rows) - you can select some principal principal components and compute the scores of your data on those components instead.
But first let's make some things clear about your data - it seems you have 10 samples (rows) with 4 features (columns). Dimensionality reduction will reduce the 4 features to a smaller set of features.
So you can start by transposing your matrix for svd():
dat <- t(dat)
dat
1 2 3 4 5 6 7 8 9 10
Sepal.Length 6.7 6.1 5.8 5.1 6.1 5.1 4.8 5.2 6.1 5.7
Sepal.Width 3.1 2.8 4.0 3.8 3.0 3.7 3.0 4.1 2.8 3.8
Petal.Length 4.4 4.0 1.2 1.5 4.6 1.5 1.4 1.5 4.7 1.7
Petal.Width 1.4 1.3 0.2 0.3 1.4 0.4 0.1 0.1 1.2 0.3
Now you can repeat the svd. Centering the data before this procedure is advisable:
s <- svd(dat - rowMeans(dat))
Principal components can be obtained by projecting your data onto PCs.
PCs <- t(s$u) %*% dat
Now if you want to reduce dimensionality by eliminating PCs with low variance you can do so like this:
dat2 <- PCs[1:2,] # would select first two PCs.
i've got a data frame all that look like this:
http://pastebin.com/Xc1HEYyH
Now I want to create a scatter plot with the column headings in the x-axis and the respective values as the data points. For example:
7| x
6| x x
5| x x x x
4| x x x
3| x x
2| x x
1|
---------------------------------------
STM STM STM PIC PIC PIC
cold normal hot cold normal hot
This should be easy, but I can not figure out how.
Regards
The basic idea, if you want to plot using Hadley's ggplot2 is to get your data of the form:
x y
col_names values
And this can be done by using melt function from Hadley's reshape2. Do ?melt to see the possible arguments. However, here since we want to melt the whole data.frame, we just need,
melt(all)
# this gives the data in format:
# variable value
# 1 STM_cold 6.0
# 2 STM_cold 6.0
# 3 STM_cold 5.9
# 4 STM_cold 6.1
# 5 STM_cold 5.5
# 6 STM_cold 5.6
Here, x will be then column variable and y will be corresponding value column.
require(ggplot2)
require(reshape2)
ggplot(data = melt(all), aes(x=variable, y=value)) +
geom_point(aes(colour=variable))
If you don't want the colours, then just remove aes(colour=variable) inside geom_point so that it becomes geom_point().
Edit: I should probably mention here, that you could also replace geom_point with geom_jitter that'll give you, well, jittered points:
Here are two options to consider. The first uses dotplot from the "lattice" package:
library(lattice)
dotplot(values ~ ind, data = stack(all))
The second uses dotchart from base R's "graphics" options. To use the dotchart function, you need to wrap your data.frame in as.matrix:
dotchart(as.matrix(all), labels = "")
Note that the points in this graphic are not "jittered", but rather, presented in the order they were recorded. That is to say, the lowest point is the first record, and the highest point is the last record. If you zoomed into the plot for this example, you would see that you have 16 very faint horizontal lines. Each line represents one row from each column. Thus, if you look at the dots for "STM_cold" or any of the other variables that have NA values, you'll see a few blank lines at the top where there was no data available.
This has its advantages since it might show a trend over time if the values are recorded chronologically, but might also be a disadvantage if there are too many rows in your source data frame.
A bit of a manual version using base R graphics just for fun.
Get the data:
test <- read.table(text="STM_cold STM_normal STM_hot PIC_cold PIC_normal PIC_hot
6.0 6.6 6.3 0.9 1.9 3.2
6.0 6.6 6.5 1.0 2.0 3.2
5.9 6.7 6.5 0.3 1.8 3.2
6.1 6.8 6.6 0.2 1.8 3.8
5.5 6.7 6.2 0.5 1.9 3.3
5.6 6.5 6.5 0.2 1.9 3.5
5.4 6.8 6.5 0.2 1.8 3.7
5.3 6.5 6.2 0.2 2.0 3.5
5.3 6.7 6.5 0.1 1.7 3.6
5.7 6.7 6.5 0.3 1.7 3.6
NA NA NA 0.1 1.8 3.8
NA NA NA 0.2 2.1 4.1
NA NA NA 0.2 1.8 3.3
NA NA NA 0.8 1.7 3.5
NA NA NA 1.7 1.6 4.0
NA NA NA 0.1 1.7 3.7",header=TRUE)
Set up the basic plot:
plot(
NA,
ylim=c(0,max(test,na.rm=TRUE)+0.3),
xlim=c(1-0.1,ncol(test)+0.1),
xaxt="n",
ann=FALSE,
panel.first=grid()
)
axis(1,at=seq_along(test),labels=names(test),lwd=0,lwd.ticks=1)
Plot some points, with some x-axis jittering so they are not printed on top of one another.
invisible(
mapply(
points,
jitter(rep(seq_along(test),each=nrow(test))),
unlist(test),
col=rep(seq_along(test),each=nrow(test)),
pch=19
)
)
Result:
edit
Here's an example using alpha transparency on the points and getting rid of the jitter as discussed in the below comments with Ananda.
invisible(
mapply(
points,
rep(seq_along(test),each=nrow(test)),
unlist(test),
col=rgb(0,0,0,0.1),
pch=15,
cex=3
)
)
I asked a question like this before but I decided to simplify my data format because I'm very new at R and didnt understand what was going on....here's the link for the question How to handle more than multiple sets of data in R programming?
But I edited what my data should look like and decided to leave it like this..in this format...
X1.0 X X2.0 X.1
0.9 0.9 0.2 1.2
1.3 1.4 0.8 1.4
As you can see I have four columns of data, The real data I'm dealing with is up to 2000 data points.....Columns "X1.0" and "X2.0" refer "Time"...so what I want is the average of "X" and "X.1" every 100 seconds based on my 2 columns of time which are "X1.0" and "X2.0"...I can do it using this command
cuts <- cut(data$X1.0, breaks=seq(0, max(data$X1.0)+400, 400))
by(data$X, cuts, mean)
But this will only give me the average from one set of data....which is "X1.0" and "X".....How will I do it so that I could get averages from more than one data set....I also want to stop having this kind of output
cuts: (0,400]
[1] 0.7
------------------------------------------------------------
cuts: (400,800]
[1] 0.805
Note that the output was done every 400 s....I really want a list of those cuts which are the averages at different intervals...please help......I just used data=read.delim("clipboard") to get my data into the program
It is a little bit confusing what output do you want to get.
First I change colnames but this is optional
colnames(dat) <- c('t1','v1','t2','v2')
Then I will use ave which is like by but with better output. I am using a trick of a matrix to index column:
matrix(1:ncol(dat),ncol=2) ## column1 is col1 adn col2...
[,1] [,2]
[1,] 1 3
[2,] 2 4
Then I am using this matrix with apply. Here the entire solution:
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10){ ## by 10 seconds! you can replace this
## with 100 or 400 in you real data
t.col <- dat[,x][,1] ## txxx
v.col <- dat[,x][,2] ## vxxx
ave(v.col,cut(t.col,
breaks=seq(0, max(t.col),by)),
FUN=mean)})
)
EDIT correct the cut and simplify the code
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10)ave(dat[,x][,1], dat[,x][,1] %/% by)))
X1.0 X X2.0 X.1 1 2
1 0.9 0.9 0.2 1.2 3.3000 3.991667
2 1.3 1.4 0.8 1.4 3.3000 3.991667
3 2.0 1.7 1.6 1.1 3.3000 3.991667
4 2.6 1.9 2.2 1.6 3.3000 3.991667
5 9.7 1.0 2.8 1.3 3.3000 3.991667
6 10.7 0.8 3.5 1.1 12.8375 3.991667
7 11.6 1.5 4.1 1.8 12.8375 3.991667
8 12.1 1.4 4.7 1.2 12.8375 3.991667
9 12.6 1.8 5.4 1.2 12.8375 3.991667
10 13.2 2.1 6.3 1.3 12.8375 3.991667
11 13.7 1.6 6.9 1.1 12.8375 3.991667
12 14.2 2.2 9.4 1.3 12.8375 3.991667
13 14.6 1.8 10.0 1.5 12.8375 10.000000