Positive and negative subsetting using dplyr::contains() and dplyr::select() in R - r

I'm trying to achieve positive subsetting specifically using a combination of dplyr::select() and dplyr::contains()`, with the goal being to subset by multiple string matches.
Minimal working example: when starting off with df1 and doing negative subsetting, I generate df2 as expected. In contrast, when attempting positive subsetting of df1, I generate df3 (no columns) when I'd have expected something like df4. Thanks for any help.
df1 <- data.frame("ppt_paint"=c(45,98,23),"het_heating"=c(1,1,2) ,"orm_wood"=c("QQ","OA","BB"), "hours"=c(4,6,4), "distance"=c(23,65,21))
df2 <- df1 %>% select(-contains("ppt_")) %>% select(-contains("het_")) %>% select(-contains("orm_"))
df3 <- df1 %>% select(contains("ppt_")) %>% select(contains("het_")) %>% select(contains("orm_"))
df4 <- data.frame("ppt_paint"=c(45,98,23),"het_heating"=c(1,1,2) ,"orm_wood"=c("QQ","OA","BB"))

Think (and have a look to the resulting data.frame) to what happens after: df1 %>% select(contains("ppt_")). As asked, it only retains the only column that contains "ppt_". Further expressions cannot work as you expect since other columns, no matter what you're feeding select with, are "no longer" there.
You can keep the same idea but combine in the same select you three keys:
df1 %>% select(matches("ppt_"), matches("het_"), matches("orm_"))
ppt_paint het_heating orm_wood
1 45 1 QQ
2 98 1 OA
3 23 2 BB
Alternatively, you can do it with matches, that accepts regular expressions:
df1 %>% select(matches(c("ppt_|het_|orm_")))
ppt_paint het_heating orm_wood
1 45 1 QQ
2 98 1 OA
3 23 2 BB
And by the way you can also use it to shorten your "negative" indexing:
df1 %>% select(-matches("ppt_|het_|orm_"))
hours distance
1 4 23
2 6 65
3 4 21

Related

Filtering/subsetting R dataframe based on each rows n'th position value

I have a 'df' with 2 columns:
Combinations <- c(0011111111, 0011113111, 0013113112, 0022223114)
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)
I am trying to find a way to subset or filter the dataframe where the 'Combinations' column's 7th, 8th, and 9th digits equal 311. For the example given, I would expect Combination's 0011113111, 0013113112, 0022223114
There are also instances where I would need to find different combinations, in different nth positions.
I know substring() can find these values for single rows but I'm not sure how to apply it to an entire dataframe.
subtring will work with vectors as well.
subset(df, substring(Combinations, 7, 9) == 311)
# Combinations Values
#2 0011113111 2
#3 0013113112 3
#4 0022223114 4
data
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
Another base R idea:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
df[grep(pattern = "^[0-9]{6}311.$", df$Combinations), ]
Output:
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
As a tip, if you want to know more about regular expressions, this website helps me a lot: https://regexr.com/3elkd
Would this work?
library(dplyr)
library(stringr)
df %>% filter(str_sub(Combinations, 7,9) == 311)
Combinations Values
1 0011113111 2
2 0013113112 3
3 0022223114 4
Not pretty but works:
df[which(lapply(strsplit(df$Combinations, ""), function(x) which(x[7]==3 & x[8]==1 & x[9]==1))==1),]
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
Data:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)

Can I split character vector based on position in R?

This is my first post so all tips for posting are helpful :)
I want to merge two data frames with the same person-IDs, but the identifier is slightly (but systematically) different from one another.
in df A the ID is: <3 digits>
in df B the ID is: -<3 digits>
See example below:
A_ID <- c("A123", "B213", "C421", "C312")
A_score <- c(8,10,9,10)
A <- data.frame(A_ID, A_score)
colnames(A) <- c("ID", "A_score")
B_ID <- c("A-123", "B-213", "C-421", "C-312")
B_score <- c(2,10,9,10)
B <- data.frame(B_ID, B_score)
colnames(B) <- c("ID", "B_score")
The problem is that because of the - in the middle of df B identifiers, these dfs won't merge.
What i want to achieve is to merge (fulljoin) the dfs to form columns: ID | A_score | B_score.
I tried converting the character vector to strings and then split the ID (A) at characterposition 2 after the letter, add a - and then paste and convert back to character vector. But i feel this is probably not the most efficient and easy way.
Thank you!
Try this using gsub() to clean the second id variable and then merge the dataframes in a one data pipeline. Here the code using tidyverse functions:
library(tidyverse)
#Code
NewA <- A %>% full_join(B %>% mutate(ID=gsub('-','',ID)))
Output:
ID A_score B_score
1 A123 8 2
2 B213 10 10
3 C421 9 9
4 C312 10 10
Use sub to ge rid of - and then merge:
B$ID <- sub("-", "", B$ID)
merge(A, B, "ID")
ID A_score B_score
1 A123 8 2
2 B213 10 10
3 C312 10 10
4 C421 9 9

R: a tidy way to count number of rows between pipes?

Using R and tidyr, I want to count the number of rows of a dataframe between two pipes; is there an elegant way to do this?
I ran into the problem when removing rows with only NA, then having to count the number of rows of the dataframe so far. Can I do this without storing the dataframe between the pipes?
Here is a reproducible example. I essentially need XXX to refer to the dataframe after drop_na().
library(dplyr)
scrap <- as.data.frame(matrix(1:16, ncol = 4))
scrap[4,] <- rep(NA, 4)
scrap %>%
drop_na() %>%
mutate(index=c(1:nrow(XXX)))
I thought it would guess what I am doing if I do not refer to anything as below, but no.
scrap %>%
drop_na() %>%
mutate(index=c(1:nrow()))
Error in nrow() : argument "x" is missing, with no default
Is there an elegant solution I am missing?
Not sure if I understood your question, but using dplyr::row_number :
library(dplyr)
scrap %>%
drop_na() %>%
mutate(index=row_number())
Returns:
V1 V2 V3 V4 index
1 1 5 9 13 1
2 2 6 10 14 2
3 3 7 11 15 3

Trying to avoid a for loop in r

I have some code that works but is very clunky and I'm sure there is a better way to do it, avoiding the for loop. Essentially I have a list of performances, and a list of factors. And I want to assign the highest performance to the highest factors, the lowest performance to the lowest factors, etc. Here is some simplified sample code:
#My simplified sample list of performances:
PerformanceList <- data.frame(v1 <- c(rep(10,4)), v2 <- c(rep(9,4)), v3 <- c(rep(8,4)))
View(PerformanceList)
v1 v2 v3
1 10 9 8
2 10 9 8
3 10 9 8
4 10 9 8
#My simplified sample list of Factors:
MyFactors <- data.frame(v1 <- c(35,25,15,5), v2 <- c(10,20,60,20), v3 <- c(5,10,15,40))
View(MyFactors)
v1 v2 v3
1 35 10 5
2 25 20 10
3 15 60 50
4 5 20 40
#Code to find the ranking of each row from largest to smallest:
Rankings <- data.frame(t(apply(-MyFactors, 1, rank, na.last="keep",ties.method="random")))
View(Rankings)
v1 v2 v3
1 1 2 3
2 1 2 3
3 3 1 2
4 3 2 1
Function to sort each row by ranking. I assume there is a better way to do this but I couldn't figure it out:
SortFunction <- function(RankingList){
SortedRankings <- order(RankingList)
return(SortedRankings)
}
#applying that Sort function to each row of the data frame:
SortedRankings <- data.frame(t(apply(Rankings, 1,SortFunction)))
View(SortedRankings)
X1 X2 X3
1 1 2 3
2 1 2 3
3 2 3 1
4 3 2 1
Here is a for loop that does what I want but I'm sure it's not the best way to do it. Basically I want to go down each row of my PerformanceList and choose the column that corresponds to the highest Ranking (which is column 1 from my Sorted Rankings above). I'd ideally like to then be able to assign column 2 from those Sorted Rankings to assign the second highest performance to my second highest factor, and so on...
FactorPerformanceList <- data.frame(matrix(NA, ncol=1, nrow=NROW(Rankings)))
for (i in 1:NROW(Rankings)){
FactorPerformanceList[i,] <- PerformanceList[i,SortedRankings[i,1]]
}
View(FactorPerformanceList)
1 10
2 10
3 9
4 8
It seems like this should work but it gives a matrix of 4 rows by 4 columns instead:
FactorPerformanceList2 <- PerformanceList[,SortedRankings[,1]]
View(FactorPerformanceList2)
v1 v1 v2 v3
1 10 10 9 8
2 10 10 9 8
3 10 10 9 8
4 10 10 9 8
Any ideas or help would be greatly appreciated! Thank you!
This technically does not remove the for-loop, it just hides it. That said, it's a lot cleaner code than what you have, and unless you need all the intermediate data steps, it simplifies things greatly.
PerformanceList <- data.frame(
v1= c(rep(10,4)),
v2= c(rep(9,4)),
v3 = c(rep(8,4))
)
MyFactors <- data.frame(
v1 = c(35,25,15,5),
v2 = c(10,20,60,20),
v3 = c(5,10,15,40))
FactorPerformanceList <- as.data.frame(t(sapply(1:nrow(PerformanceList), function(i) {
PerformanceList[i,order(MyFactors[i,])]
})))
The same code can be written
library(tidyverse)
FactorPerformanceList <- 1:nrow(PerformanceList) %>%
sapply(function(i) {
PerformanceList[i,order(MyFactors[i,])]
}) %>%
t() %>%
as.data.frame()
which makes the order of operations a little clearer (sapply, then t, then as.data.frame).
In general, for-loops can be avoided completely when you're working with columns, but row-wise operations aren't as easy to remove entirely. You can clean up the code by using the apply family of functions, or (if you want something fancier) the plyr or purrr packages.
Given the lack of clarity I've come up with a somewhat flexible answer for you.
It might make sense to take a given data.frame and force it to take a long format, we can make sure we maintain the index positions from the prior structure as this is what you might use to join other data.frames to one another.
I've chosen to use the tidyverse suite of packages to answer this, namely dplyr.
Data
library(tidyverse)
PerformanceList <- data.frame(v1 = c(rep(10,4)), v2 = c(rep(9,4)), v3 = c(rep(8,4)))
MyFactors <- data.frame(v1 = c(35,25,15,5), v2 = c(10,20,60,20), v3 = c(5,10,15,40))
This function will take a data.frame and provide a long format data.frame with index position columns.
Function to convert to long data.frame with index ranks
df_ranks <- function(df) {
names(df) <- 1:ncol(df)
df %>%
mutate(row_index = 1:nrow(.)) %>%
gather(col_index, value, -row_index) %>%
group_by(row_index) %>%
mutate(row_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
group_by(col_index) %>%
mutate(col_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
ungroup()
}
Applying the function to the data, and making sure to adjust column names will let us join without much hassle.
ranked_perf <- df_ranks(PerformanceList) %>% setNames(paste0("rank_", names(.)))
ranked_fact <- df_ranks(MyFactors) %>% setNames(paste0("fact_", names(.)))
We can then join the tables, its important to understand what you want to do and what the expected result may be before this step. For this example I've said that I want to have the matching values within a column by its rank.
full_join(ranked_perf, ranked_fact,
by = c("rank_col_rank" = "fact_col_rank",
"rank_col_index" = "fact_col_index"))
As to what you want to do with this result is up to you, you can select columns and manipulate it back to wide format using combinations of select, unite, and spread.

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Resources