add square of all items in a column as a row in R data frame - r

I have a dataframe as shown below containing 3 rows or n rows more generally. I want to add a 4'th row or n+1'th row containing sum of squares of all items of that column.
x<-data.frame("a" = c(2,3,4),"b" =c(3,4,5))
> x
a b
1 2 3
2 3 4
3 4 5
In the above example, the 4'th row should contain value of 29 and 50 respectively.

An option is
library(dplyr)
x %>%
summarise_all(~ sum(.^2)) %>%
bind_rows(x, .)
#. a b
#1 2 3
#2 3 4
#3 4 5
#4 29 50
Or in base R
rbind(x, colSums(x^2))

Related

How to rearrange columns of a data frame based on values in a row

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.
We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

How to subset a data.frame according to the values of last two rows?

###the original data
df1 <- data.frame(a=c(2,2,5,5,7), b=c(1,5,4,7,6))
df2 <- data.frame(a=c(2,2,5,5,7,7), b=c(1,5,4,7,6,3))
when the a column value of the last two rows are not equal (here the 4th row is not equal to the 5th row, namely, 5!=7), I want to subset the last row only.
#input
> df1
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
#output
> df1
a b
1 7 6
when the a column value of the last two rows are equal (here 5th row is equal to the 6th row, namely, 7=7, I want to subset the last two rows
#input
> df2
a b
1 2 1
2 2 5
3 5 4
4 5 7
5 7 6
6 7 3
#output
> df2
a b
1 7 6
2 7 3
You can write a function to check last two row values for a column :
return_rows <- function(data) {
n <- nrow(data)
if(data$a[n] == data$a[n - 1])
tail(data, 2)
else tail(data, 1)
}
return_rows(df1)
# a b
#5 7 6
return_rows(df2)
# a b
#5 7 6
#6 7 3
try it this way
library(tidyverse)
df %>%
filter(a == last(a))
a b
5 7 6
a b
5 7 6
6 7 3
We can use subset from base R
subset(df1, a == a[length(a)])

Combining two columns using shared values in first column

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?
Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

How to find first occurrence of a vector of numeric elements within a data frame column?

I have a data frame (min_set_obs) which contains two columns: the first containing numeric values, called treatment, and the second an id column called seq:
min_set_obs
Treatment seq
1 29
1 23
3 60
1 6
2 41
1 5
2 44
Let's say I have a vector of numeric values, called key:
key
[1] 1 1 1 2 2 3
I.e. a vector of three 1s, two 2s, and one 3.
How would I go about identifying which rows from my min_set_obs data frame contain the first occurrence of values from the key vector?
I'd like my output to look like this:
Treatment seq
1 29
1 23
3 60
1 6
2 41
2 44
I.e. the sixth row from min_set_obs was 'extra' (it was the fourth 1 when there should only be three 1s), so it would be removed.
I'm familiar with the %in% operator, but I don't think it can tell me the position of the first occurrence of the key vector in the first column of the min_set_obs data frame.
Thanks
Here is an option with base R, where we split the 'min_set_obs' by 'Treatment' into a list, get the head of elements in the list using the corresponding frequency of 'key' and rbind the list elements to a single data.frame
res <- do.call(rbind, Map(head, split(min_set_obs, min_set_obs$Treatment), n = table(key)))
row.names(res) <- NULL
res
# Treatment seq
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60
Use dplyr, you can firstly count the keys using table and then take the top n rows correspondingly from each group:
library(dplyr)
m <- table(key)
min_set_obs %>% group_by(Treatment) %>% do({
# as.character(.$Treatment[1]) returns the treatment for the current group
# use coalesce to get the default number of rows (0) if the treatment doesn't exist in key
head(., coalesce(m[as.character(.$Treatment[1])], 0L))
})
# A tibble: 6 x 2
# Groups: Treatment [3]
# Treatment seq
# <int> <int>
#1 1 29
#2 1 23
#3 1 6
#4 2 41
#5 2 44
#6 3 60

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Resources