How to calculate mean for specific columns in r? - r

The goal: I want to create 2 new columns by using R.
1 column which shows the mean of each row (but only calculating specific columns - only the mean of the columns which do not contain the string "_X")
1 column which shows the mean of each row (but only calculating specific columns - only the mean of the columns which do contain the string "_X").
For example:
phone1 phone1_X phon2 phone2_X phone3 phone3_X
1 2 3 4 5 6
2 4 6 8 10 12
Results:
Mean_of_none_X
3 (1+3+5)/3
6 (2+5+10)3
Mean_of_X
4
8
Thank you!

Try using rowMeans and grep over the column names to include/exclude certain columns:
# only "_x"
rowMeans(df[,grep("_x",colnames(df))])
# No "_x"
rowMeans(df[,-grep("_x",colnames(df))])
Output:
#> # only "_x"
#> rowMeans(df[,grep("_x",colnames(df))])
#[1] 4 8
#> # No "_x"
#> rowMeans(df[,-grep("_x",colnames(df))])
#[1] 3 6

Try this
> lapply(split.default(df, endsWith(names(df), "_X")), rowMeans)
$`FALSE`
[1] 3 6
$`TRUE`
[1] 4 8

library(dplyr)
df %>%
rowwise() %>%
mutate(x_mean = mean(c_across(contains('_X'))),
notx_mean = mean(c_across(!contains('_X') & !contains('_mean'))))

Related

create a data-frame with the new column values being the names of a list

Given the following unequal list :
lst <- list("es1-7"= c(1,2,3,4), "sa1-12"=c(3,4) , "ES8-13"= c(9,7,4,1,5,2))
> lst
$`es1-7`
[1] 1 2 3 4
$`sa1-12`
[1] 3 4
$`ES8-13`
[1] 9 7 4 1 5 2
I would like to create a data-frame like this:
group numbers
1 es1-7 1
2 es1-7 2
3 es1-7 3
4 es1-7 4
5 sa1-12 3
6 ... ...
So in this case names of the list will be values of a new column called groupand numbers will be the values of the list.
Solutions using base and dplyr are more than welcome
We can use stack from base R to create a two column data.frame
stack(lst)[2:1]
Or with enframe
library(tidyverse)
enframe(lst, name = "group", value = "numbers") %>%
unnest

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Combining two columns using shared values in first column

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?
Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

How to pass column name as parameter to function in dplyr?

I want to do the same as here but with dplyr and one more column.
I want to selecting a column via a string variable, but on top I also want to select a second column normally.
I need this because I have a function which selects a couple of columns by a given parameters.
I have the following code as an example:
library(dplyr)
data(cars)
x <- "speed"
cars %>% select_(x, dist)
You can use quote() for the dist column
x <- "speed"
cars %>% select_(x, quote(dist)) %>% head
# speed dist
# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10
I know I'm a little late to this one, but I figured I would add it for others.
x <- "speed"
cars %>% select(one_of(x),dist) %>% head()
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
OR this would work too
cars %>% select(one_of(c(x,'dist')))

Resources