R rowSums and select inside mutate function dplyr - r

I am trying to select columns in my dataframe and add them together if they exist. I tried
newdf4 %>% select(any_of(contains(c("adSize_300 x 600","adSize_160 x 600","adSize_120 x 600","adSize_125 x 600")))) %>% mutate(vertical_sizes=rowSums(.))
It gives me a separate output of the two columns that exist and the new vertical_sizes column created:
adSize_300 x 600 adSize_160 x 600 vertical_sizes
1 1 0 1
2 0 0 0
3 0 0 0
4 0 0 0
5 0 1 1
6 0 0 0
7 1 0 1
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
However, I want the new vertical sizes column to be added to my original newdf4 dataframe.
I am trying:
newdf4 %>% mutate(vertical_sizes = rowSums(select(any_of(contains(c("adSize_300 x 600","adSize_160 x 600","adSize_120 x 600","adSize_125 x 600"))))))
But i receive this error:
Error in `mutate_cols()`:
! Problem with `mutate()` column `vertical_sizes`.
i `vertical_sizes = rowSums(...)`.
x `any_of()` must be used within a *selecting* function.
i See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
Caused by error:
! `any_of()` must be used within a *selecting* function.
i See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
Run `rlang::last_error()` to see where the error occurred.
Please let me know if there is another approach to this. Thank you!

select inside of mutate requires data as its first argument, it will not infer or assume that it should look in the enclosing environment. For instance, when one does something like
mtcars %>%
select(cyl, mpg, disp)
The %>% is placing the data in the first argument of the RHS function, so this is more-aptly shown as
mtcars %>%
select(., cyl, mpg, disp)
where . indicates where the data is going. %>% allows one to specify where in the RHS expression the data should be placed (generally it works so long as the . is not placed within nested expressions).
Having said that, your use of select places the any_of in the first argument, which is (obviously) not going to work.
I suggest we use cur_data() (one of dplyr's context-dependent expressions) as the first argument.
Data:
read.table(text = '
adSize_300x600 adSize_160x600 vertical_sizes
1 1 0 1
2 0 0 0
3 0 0 0
4 0 0 0
5 0 1 1
6 0 0 0
7 1 0 1
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0')
Code:
newdf4 %>%
mutate(
vertical_sizes = rowSums(
select(cur_data(),
contains(c("adSize_300x600", "adSize_160x600", "adSize_120x600", "adSize_125x600")))
)
)
# adSize_300x600 adSize_160x600 vertical_sizes
# 1 1 0 1
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 0 1 1
# 6 0 0 0
# 7 1 0 1
# 8 0 0 0
# 9 0 0 0
# 10 0 0 0
# 11 0 0 0

Related

mlogit gives error: the two indexes don't define unique observations

My dataframe named longData looks like:
ID Set Choice Apple Microsoft IBM Google Intel HewlettPackard Sony Dell Yahoo Nokia
1 1 1 0 1 0 0 0 0 0 0 0 0 0
2 1 2 0 0 1 0 0 0 0 0 0 0 0
3 1 3 0 0 0 1 0 0 0 0 0 0 0
4 1 4 1 0 0 0 1 0 0 0 0 0 0
5 1 5 0 0 0 0 0 0 0 0 0 0 1
6 1 6 0 -1 0 0 0 0 0 0 0 0 0
I am trying to run mlogit on it by:
logitModel = mlogit(Choice ~ Apple+Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0, data = longData, shape = "long")
it gives the following error:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
after looking for some time I found that this error was given by dfidx as seen in here as:
z <- data[, c(posid1[1], posid2[1])]
if (nrow(z) != nrow(unique(z)))
stop("the two indexes don't define unique observations")
but upon calling the following code, it runs without the error and gives the names of two idx that are uniquely able to identify a row in dataframe:
dfidx(longData)$idx
this gives expected output as:
~~~ indexes ~~~~
ID Set
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
8 1 8
9 1 9
10 1 10
indexes: 1, 2
So what am I doing wrong, I saw some related questions 1, 2 but couldn't find what I am missing.
It looks like your example comes from here: https://docs.displayr.com/wiki/MaxDiff_Analysis_Case_Study_Using_R
The code seems outdated, I remember it worked for me, but not anymore.
The error message is valid because every pair (ID, Set) appears several times, once for each alternative.
However this works:
# there will be complaint that choice can't be coerced to logical otherwise
longData$Choice <- as.logical(longData$Choice)
# create alternative number (nAltsPerSet is 5 in this example)
longData$Alternative <- 1+( 0:(nrow(longData)-1) %% nAltsPerSet)
# define dataset
mdata <- mlogit.data(data=longData,shape="long", choice="Choice",alt.var="Alternative",id.var="ID")
# model
logitModel = mlogit(Choice ~ Microsoft+IBM+Google+Intel+HewlettPackard+Sony+Dell+Yahoo+Nokia | 0,
data = mdata
)
summary(logitModel)

Calling groups of string digits using grepl

I'm using the grepl function to try and sort through data; all the row numbers are different survey respondents, and each number in the "ANI_type" string represents a different type of animal - I need to sort these depending on animal type. More specifically, I need to group some of the digits in the strings into categories. For example, digits 6,7,8,9,10,11 all need to be placed in the animals$pock object. How would I go about that using the grep function?
> animals$dogs <- as.numeric(grepl("\\b1\\b", animals$ANI_type))
> animals
ANI_type dogs cats repamp
1 1,2,5,12,13,14,15,16,18,19,27 1 1 0
2 2 0 1 0
3 20,21,22,23,26 0 0 0
4 20,21,22,23 0 0 0
5 13 0 0 0
6 2 0 1 0
7 20,21,22 0 0 0
8 20,21,22,23 0 0 0
9 20,21,22 0 0 0
10 5,20,21,22,27 0 0 0
11 1,2,20,21,22 1 1 0
12 5,18,20,21,22,23,26 0 0 0
13 20,21 0 0 0
14 21 0 0 0
15 20,21 0 0 0
16 20,21,26 0 0 0
17 2 0 1 0
18 1,2 1 1 0
19 2 0 1 0
20 3,4 0 0 1
The expected output is the columns (dog, cat, repamp) above... these were easy to do as there is only one digit; what I'm having trouble with is splitting up multiples.
A tidyverse solution could be employed with mutate() and if_else() from the dplyr library, and grepl(), for example:
animals <- animals %>%
mutate(dogs = if_else(grepl("\\b1\\b|\\b22\\b", ANI_TYPE),
cats = if_else(grepl("\\b2\\b|\\b31\\b", ANI_TYPE))
In this case, you'd want to separate all the different potential codes for each animal using the pipe operator | which functions as an OR operator in R.

How to use column numbers in the dplyr filter function

How do I use the dplyr::filter() function with column numbers, rather than column names?
As an example, I'd like to pick externally selected columns and return the rows that are all zeros. For example, for a data frame like this
> test
# A tibble: 10 x 4
C001 C007 C008 C020
<dbl> <dbl> <dbl> <dbl>
1 -1 -1 0 0
2 0 0 0 0
3 1 1 0 0
4 -1 -1 0 0
5 0 0 0 -1
6 0 0 0 1
7 0 1 1 0
8 0 0 -1 -1
9 1 1 0 0
10 0 0 0 0
and a vector S = c(1,3,4) How would I pick all the rows in test where all(x==0)? I can do this with an test[apply(test[,S] 1, function(x){all(x==0)},] but I'd like to use this as part of a %>% pipeline.
I have not been able to figure out the filter() syntax to use column numbers rather than names. The real data has many more columns (>100) and rows and the column numbers are supplied by an external algorithm.
Use filter_at with all_vars
library(dplyr)
df %>% filter_at(c(1,3,4), all_vars(.==0))
C001 C007 C008 C020
1 0 0 0 0
2 0 0 0 0

Using dplyr to gather specific dummy variables

This question is the extension of (Using dplyr to gather dummy variables) .
The question: How can I gather only a few columns, instead of the whole dataset? So in this example, I want to gather all the columns, but except "sedan". My real data set has 250 columns, so therefore it will be great if I can include/exclude the columns by name.
Data set
head(type)
x convertible coupe hatchback sedan wagon
1 0 0 0 1 0
2 0 1 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Output
TypeOfCar
1 x
2 coupe
3 convertible
4 convertible
5 convertible
6 convertible
Not sure if i'm understanding you, but you can do what you want:
df %>% select(-sedan) %>% gather(Key, Value)
And if you have to much variables you can use:
select(-contains(""))
select(-start_wi(""))
select(-ends_with(""))
Hope it helps.
You can use -sedan in gather:
dat %>% gather(TypeOfCar, Count, -sedan) %>% filter(Count >= 1) %>% select(TypeOfCar)
# TypeOfCar
# 1 convertible
# 2 convertible
# 3 convertible
# 4 convertible
# 5 coupe
Data:
tt <- "convertible coupe hatchback sedan wagon
1 0 0 0 1 0
2 0 1 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0"
dat <- read.table(text = tt, header = T)
Fixed it with a combination of #RLave and #Carlos Vecina
right_columns <- all_data %>% select(starts_with("hour"))
all_data$all_hour <-data.frame(new_column = names(right_columns )[as.matrix(right_columns )%*%seq_along(right_columns )],stringsAsFactors=FALSE)

counting the occurrences of a number and when it occurred in R data.frame and data.table

I have newly started to learn R, so my question may be utterly ridiculous. I have a data frame
data<- data.frame('number'=1:11, 'col1'=sample(10:20),'col2'=sample(10:20),'col3'=sample(10:20),'col4'=sample(10:20),'col5'=sample(10:20), 'date'= c('12-12-2014','12-11-2014','12-10-2014','12-09-2014', '12-08-2014','12-07-2014','12-06-2014','12-05-2014','12-04-2014', '12-04-2014', '12-03-2014') )
The number column is an 'id' column and the last column is a date.
I want to count the number of times that each number occurs across (not per column, but the whole data frame containing data) the columns 2:6 and when they occurred.
I am stuck on the first part having tried the following using data.table:
count <- function(){
i = 1
DT <-data.table(data[2:6])
for (i in 10:20){
DT[, .N, by =i]
i = i + 1
}
}
which gives an error that I don't begin to understand
Error in `[.data.table`(DT, , .N, by = i) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (11)
Can someone help, please. Also with the second part that I have not even attempted yet i.e. associating a date or a row number with each occurrence of a number
Perhaps you may want this
library(reshape2)
table(melt(data[,-1], id.var='date')[,-2])
# value
#date 10 11 12 13 14 15 16 17 18 19 20
# 12-03-2014 0 0 1 0 0 1 0 0 1 2 0
# 12-04-2014 2 0 0 2 2 0 1 0 1 1 1
# 12-05-2014 0 0 0 0 0 0 1 1 2 0 1
# 12-06-2014 1 1 0 0 0 1 0 1 0 0 1
# 12-07-2014 0 1 0 1 0 1 1 1 0 0 0
# 12-08-2014 1 1 0 0 1 0 0 1 1 0 0
# 12-09-2014 0 0 2 0 1 2 0 0 0 0 0
# 12-10-2014 0 0 1 1 0 0 1 0 0 1 1
# 12-11-2014 0 1 1 0 0 0 1 0 0 1 1
# 12-12-2014 1 1 0 1 1 0 0 1 0 0 0
Or if you need a data.table solution (from #Arun's comments)
library(data.table)
dcast.data.table(melt(setDT(data),
id="date", measure=2:6), date ~ value)

Resources