R: a tidy way to count number of rows between pipes? - r

Using R and tidyr, I want to count the number of rows of a dataframe between two pipes; is there an elegant way to do this?
I ran into the problem when removing rows with only NA, then having to count the number of rows of the dataframe so far. Can I do this without storing the dataframe between the pipes?
Here is a reproducible example. I essentially need XXX to refer to the dataframe after drop_na().
library(dplyr)
scrap <- as.data.frame(matrix(1:16, ncol = 4))
scrap[4,] <- rep(NA, 4)
scrap %>%
drop_na() %>%
mutate(index=c(1:nrow(XXX)))
I thought it would guess what I am doing if I do not refer to anything as below, but no.
scrap %>%
drop_na() %>%
mutate(index=c(1:nrow()))
Error in nrow() : argument "x" is missing, with no default
Is there an elegant solution I am missing?

Not sure if I understood your question, but using dplyr::row_number :
library(dplyr)
scrap %>%
drop_na() %>%
mutate(index=row_number())
Returns:
V1 V2 V3 V4 index
1 1 5 9 13 1
2 2 6 10 14 2
3 3 7 11 15 3

Related

tidyverse rename_with giving error when trying to provide new names based on existing column values

Assuming the following data set:
df <- data.frame(...1 = c(1, 2, 3),
...2 = c(1, 2, 3),
n_column = c(1, 1, 2))
I now want to rename all vars that start with "...". My real data sets could have different numbers of "..." vars. The information about how many such vars I have is in the n_column column, more precisely, it is the maximum of that column.
So I tried:
df %>%
rename_with(.cols = starts_with("..."),
.fn = paste0("new_name", 1:max(n_column)))
which gives an error:
# Error in paste0("new_name", 1:max(n_column)) :
# object 'n_column' not found
So I guess the problem is that the paste0 function does look for the column I provide within the current data set. Not sure, though, how I could do so. Any ideas?
I know I could bypass the whole thing by just creating an external scalar that contains the max. of n_column, but ideally I'd like to do everything in one pipeline.
You don't need information from n_column, .cols will pass only those columns that satisfy the condition (starts_with("...")).
library(dplyr)
df %>% rename_with(~paste0("new_name", seq_along(.)), starts_with("..."))
# new_name1 new_name2 n_column
#1 1 1 1
#2 2 2 1
#3 3 3 2
This is safer than using max(n_column) as well, for example if the data from n_column gets corrupted or the number of columns with ... change this will still work.
A way to refer to column values in rename_with would be to use anonymous function so that you can use .$n_column.
df %>%
rename_with(function(x) paste0("new_name", 1:max(.$n_column)),
starts_with("..."))
I am assuming this is part of longer chain so you don't want to use max(df$n_column).
We can use str_c
library(dplyr)
library(stringr)
df %>%
rename_with(~str_c("new_name", seq_along(.)), starts_with("..."))
Or using base R
i1 <- startsWith(names(df), "...")
names(df)[i1] <- sub("...", "new_name", names(df)[i1], fixed = TRUE)
df
new_name1 new_name2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
A completly other approach would be
df %>% janitor::clean_names()
x1 x2 n_column
1 1 1 1
2 2 2 1
3 3 3 2

Trying to avoid a for loop in r

I have some code that works but is very clunky and I'm sure there is a better way to do it, avoiding the for loop. Essentially I have a list of performances, and a list of factors. And I want to assign the highest performance to the highest factors, the lowest performance to the lowest factors, etc. Here is some simplified sample code:
#My simplified sample list of performances:
PerformanceList <- data.frame(v1 <- c(rep(10,4)), v2 <- c(rep(9,4)), v3 <- c(rep(8,4)))
View(PerformanceList)
v1 v2 v3
1 10 9 8
2 10 9 8
3 10 9 8
4 10 9 8
#My simplified sample list of Factors:
MyFactors <- data.frame(v1 <- c(35,25,15,5), v2 <- c(10,20,60,20), v3 <- c(5,10,15,40))
View(MyFactors)
v1 v2 v3
1 35 10 5
2 25 20 10
3 15 60 50
4 5 20 40
#Code to find the ranking of each row from largest to smallest:
Rankings <- data.frame(t(apply(-MyFactors, 1, rank, na.last="keep",ties.method="random")))
View(Rankings)
v1 v2 v3
1 1 2 3
2 1 2 3
3 3 1 2
4 3 2 1
Function to sort each row by ranking. I assume there is a better way to do this but I couldn't figure it out:
SortFunction <- function(RankingList){
SortedRankings <- order(RankingList)
return(SortedRankings)
}
#applying that Sort function to each row of the data frame:
SortedRankings <- data.frame(t(apply(Rankings, 1,SortFunction)))
View(SortedRankings)
X1 X2 X3
1 1 2 3
2 1 2 3
3 2 3 1
4 3 2 1
Here is a for loop that does what I want but I'm sure it's not the best way to do it. Basically I want to go down each row of my PerformanceList and choose the column that corresponds to the highest Ranking (which is column 1 from my Sorted Rankings above). I'd ideally like to then be able to assign column 2 from those Sorted Rankings to assign the second highest performance to my second highest factor, and so on...
FactorPerformanceList <- data.frame(matrix(NA, ncol=1, nrow=NROW(Rankings)))
for (i in 1:NROW(Rankings)){
FactorPerformanceList[i,] <- PerformanceList[i,SortedRankings[i,1]]
}
View(FactorPerformanceList)
1 10
2 10
3 9
4 8
It seems like this should work but it gives a matrix of 4 rows by 4 columns instead:
FactorPerformanceList2 <- PerformanceList[,SortedRankings[,1]]
View(FactorPerformanceList2)
v1 v1 v2 v3
1 10 10 9 8
2 10 10 9 8
3 10 10 9 8
4 10 10 9 8
Any ideas or help would be greatly appreciated! Thank you!
This technically does not remove the for-loop, it just hides it. That said, it's a lot cleaner code than what you have, and unless you need all the intermediate data steps, it simplifies things greatly.
PerformanceList <- data.frame(
v1= c(rep(10,4)),
v2= c(rep(9,4)),
v3 = c(rep(8,4))
)
MyFactors <- data.frame(
v1 = c(35,25,15,5),
v2 = c(10,20,60,20),
v3 = c(5,10,15,40))
FactorPerformanceList <- as.data.frame(t(sapply(1:nrow(PerformanceList), function(i) {
PerformanceList[i,order(MyFactors[i,])]
})))
The same code can be written
library(tidyverse)
FactorPerformanceList <- 1:nrow(PerformanceList) %>%
sapply(function(i) {
PerformanceList[i,order(MyFactors[i,])]
}) %>%
t() %>%
as.data.frame()
which makes the order of operations a little clearer (sapply, then t, then as.data.frame).
In general, for-loops can be avoided completely when you're working with columns, but row-wise operations aren't as easy to remove entirely. You can clean up the code by using the apply family of functions, or (if you want something fancier) the plyr or purrr packages.
Given the lack of clarity I've come up with a somewhat flexible answer for you.
It might make sense to take a given data.frame and force it to take a long format, we can make sure we maintain the index positions from the prior structure as this is what you might use to join other data.frames to one another.
I've chosen to use the tidyverse suite of packages to answer this, namely dplyr.
Data
library(tidyverse)
PerformanceList <- data.frame(v1 = c(rep(10,4)), v2 = c(rep(9,4)), v3 = c(rep(8,4)))
MyFactors <- data.frame(v1 = c(35,25,15,5), v2 = c(10,20,60,20), v3 = c(5,10,15,40))
This function will take a data.frame and provide a long format data.frame with index position columns.
Function to convert to long data.frame with index ranks
df_ranks <- function(df) {
names(df) <- 1:ncol(df)
df %>%
mutate(row_index = 1:nrow(.)) %>%
gather(col_index, value, -row_index) %>%
group_by(row_index) %>%
mutate(row_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
group_by(col_index) %>%
mutate(col_rank = rank(value, na.last = "keep", ties.method = "random")) %>%
ungroup()
}
Applying the function to the data, and making sure to adjust column names will let us join without much hassle.
ranked_perf <- df_ranks(PerformanceList) %>% setNames(paste0("rank_", names(.)))
ranked_fact <- df_ranks(MyFactors) %>% setNames(paste0("fact_", names(.)))
We can then join the tables, its important to understand what you want to do and what the expected result may be before this step. For this example I've said that I want to have the matching values within a column by its rank.
full_join(ranked_perf, ranked_fact,
by = c("rank_col_rank" = "fact_col_rank",
"rank_col_index" = "fact_col_index"))
As to what you want to do with this result is up to you, you can select columns and manipulate it back to wide format using combinations of select, unite, and spread.

sapply results with dplyr

In the example below I am trying to determine which value is closest to each of the vals_int, by id. I can solve this problem using sapply() in a matter similar to below, but I am wondering if the sapply() part can be done with another function in dplyr.
I am really just interested in if the sapply method and output can be reproduced using some function(s) in the dplyr package. I had thought that do() may work but am struggling to determine how.
library(tidyverse)
df <- data_frame(
id = rep(1:10, 10) %>%
sort,
visit = rep(1:10, 10),
value = rnorm(100)
)
vals_int <- c(1, 2, 3)
tmp <- sapply(vals_int,
function(val_i) abs(df$value - val_i))
Yes, you can use the rowwise() and do() functions in dplyr to perform the same operation on every row, like so:
df %>% rowwise %>% do(diffs = abs(.$value - vals_int))
This will create a column called diffs in a new tibble which is a list of vectors with length 3. If you coerce the output that do() returns to be a data frame, it will instead create a tibble with three columns, one for each of the values subtracted.
df %>% rowwise %>% do(as.data.frame(t(abs(.$value - vals_int))))
The answer by #qdread does what you are looking for, but the tidyverse is starting to move away from the do() function (if that matters to you, idk). Here is an alternative method using map from the purrr package.
df %>%
mutate(closest = map(value, function(x){
abs(x - vals_int) %>%
t() %>%
as.tibble()
})) %>%
unnest()
That gives you this:
# A tibble: 100 x 6
id visit value V1 V2 V3
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.91813183 0.08186817 1.081868 2.081868
2 1 2 -1.68556173 2.68556173 3.685562 4.685562
3 1 3 -0.05984289 1.05984289 2.059843 3.059843
4 1 4 0.40128729 0.59871271 1.598713 2.598713
5 1 5 -0.09995526 1.09995526 2.099955 3.099955
6 1 6 0.81802663 0.18197337 1.181973 2.181973
7 1 7 -1.49244225 2.49244225 3.492442 4.492442
8 1 8 -0.74256185 1.74256185 2.742562 3.742562
9 1 9 -0.43943907 1.43943907 2.439439 3.439439
10 1 10 0.54985857 0.45014143 1.450141 2.450141
# ... with 90 more rows

Positive and negative subsetting using dplyr::contains() and dplyr::select() in R

I'm trying to achieve positive subsetting specifically using a combination of dplyr::select() and dplyr::contains()`, with the goal being to subset by multiple string matches.
Minimal working example: when starting off with df1 and doing negative subsetting, I generate df2 as expected. In contrast, when attempting positive subsetting of df1, I generate df3 (no columns) when I'd have expected something like df4. Thanks for any help.
df1 <- data.frame("ppt_paint"=c(45,98,23),"het_heating"=c(1,1,2) ,"orm_wood"=c("QQ","OA","BB"), "hours"=c(4,6,4), "distance"=c(23,65,21))
df2 <- df1 %>% select(-contains("ppt_")) %>% select(-contains("het_")) %>% select(-contains("orm_"))
df3 <- df1 %>% select(contains("ppt_")) %>% select(contains("het_")) %>% select(contains("orm_"))
df4 <- data.frame("ppt_paint"=c(45,98,23),"het_heating"=c(1,1,2) ,"orm_wood"=c("QQ","OA","BB"))
Think (and have a look to the resulting data.frame) to what happens after: df1 %>% select(contains("ppt_")). As asked, it only retains the only column that contains "ppt_". Further expressions cannot work as you expect since other columns, no matter what you're feeding select with, are "no longer" there.
You can keep the same idea but combine in the same select you three keys:
df1 %>% select(matches("ppt_"), matches("het_"), matches("orm_"))
ppt_paint het_heating orm_wood
1 45 1 QQ
2 98 1 OA
3 23 2 BB
Alternatively, you can do it with matches, that accepts regular expressions:
df1 %>% select(matches(c("ppt_|het_|orm_")))
ppt_paint het_heating orm_wood
1 45 1 QQ
2 98 1 OA
3 23 2 BB
And by the way you can also use it to shorten your "negative" indexing:
df1 %>% select(-matches("ppt_|het_|orm_"))
hours distance
1 4 23
2 6 65
3 4 21

R sum of rows for different group of columns that start with similar string

I'm quite new to R and this is the first time I dare to ask a question here.
I'm working with a dataset with likert scales and I want to row sum over different group of columns which share the first strings in their name.
Below I constructed a data frame of only 2 rows to illustrate the approach I followed, though I would like to receive feedback on how I can write a more efficient way of doing it.
df <- as.data.frame(rbind(rep(sample(1:5),4),rep(sample(1:5),4)))
var.names <- c("emp_1","emp_2","emp_3","emp_4","sat_1","sat_2"
,"sat_3","res_1","res_2","res_3","res_4","com_1",
"com_2","com_3","com_4","com_5","cap_1","cap_2",
"cap_3","cap_4")
names(df) <- var.names
So, what I did, was to use the grep function in order to be able to sum the rows of the specified variables that started with certain strings and store them in a new variable. But I have to write a new line of code for each variable.
df$emp_t <- rowSums(df[, grep("\\bemp.", names(df))])
df$sat_t <- rowSums(df[, grep("\\bsat.", names(df))])
df$res_t <- rowSums(df[, grep("\\bres.", names(df))])
df$com_t <- rowSums(df[, grep("\\bcom.", names(df))])
df$cap_t <- rowSums(df[, grep("\\bcap.", names(df))])
But there is a lot more variables in the dataset and I would like to know if there is a way to do this with only one line of code. For example, some way to group the variables that start with the same strings together and then apply the row function.
Thanks in advance!
One possible solution is to transpose df and calculate sums for the correct columns using base R rowsum function (using set.seed(123))
cbind(df, t(rowsum(t(df), sub("_.*", "_t", names(df)))))
# emp_1 emp_2 emp_3 emp_4 sat_1 sat_2 sat_3 res_1 res_2 res_3 res_4 com_1 com_2 com_3 com_4 com_5 cap_1 cap_2 cap_3 cap_4 cap_t
# 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 2 4 5 3 1 13
# 2 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 1 3 4 2 5 14
# com_t emp_t res_t sat_t
# 1 15 14 11 7
# 2 15 10 12 9
Agree with MrFlick that you may want to put your data in long format (see reshape2, tidyr), but to answer your question:
cbind(
df,
sapply(split.default(df, sub("_.*$", "_t", names(df))), rowSums)
)
Will do the trick
You'll be better off in the long run if you put your data into tidy format. The problem is that the data is in a wide rather than a long format. And the variable names, e.g., emp_1, are actually two separate pieces of data: the class of the person, and the person's ID number (or something like that). Here is a solution to your problem with dplyr and tidyr.
library(dplyr)
library(tidyr)
df %>%
gather(key, value) %>%
extract(key, c("class", "id"), "([[:alnum:]]+)_([[:alnum:]]+)") %>%
group_by(class) %>%
summarize(class_sum = sum(value))
First we convert the data frame from wide to long format with gather(). Then we split the values emp_1 into separate columns class and id with extract(). Finally we group by the class and sum the values in each class. Result:
Source: local data frame [5 x 2]
class class_sum
1 cap 26
2 com 30
3 emp 23
4 res 22
5 sat 19
Another potential solution is to use dplyr R rowwise function. https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-rowwise/
df %>%
rowwise() %>%
mutate(emp_sum = sum(c_across(starts_with("emp"))),
sat_sum = sum(c_across(starts_with("sat"))),
res_sum = sum(c_across(starts_with("res"))),
com_sum = sum(c_across(starts_with("com"))),
cap_sum = sum(c_across(starts_with("cap"))))

Resources