Using dplyr to gather specific dummy variables - r

This question is the extension of (Using dplyr to gather dummy variables) .
The question: How can I gather only a few columns, instead of the whole dataset? So in this example, I want to gather all the columns, but except "sedan". My real data set has 250 columns, so therefore it will be great if I can include/exclude the columns by name.
Data set
head(type)
x convertible coupe hatchback sedan wagon
1 0 0 0 1 0
2 0 1 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Output
TypeOfCar
1 x
2 coupe
3 convertible
4 convertible
5 convertible
6 convertible

Not sure if i'm understanding you, but you can do what you want:
df %>% select(-sedan) %>% gather(Key, Value)
And if you have to much variables you can use:
select(-contains(""))
select(-start_wi(""))
select(-ends_with(""))
Hope it helps.

You can use -sedan in gather:
dat %>% gather(TypeOfCar, Count, -sedan) %>% filter(Count >= 1) %>% select(TypeOfCar)
# TypeOfCar
# 1 convertible
# 2 convertible
# 3 convertible
# 4 convertible
# 5 coupe
Data:
tt <- "convertible coupe hatchback sedan wagon
1 0 0 0 1 0
2 0 1 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0"
dat <- read.table(text = tt, header = T)

Fixed it with a combination of #RLave and #Carlos Vecina
right_columns <- all_data %>% select(starts_with("hour"))
all_data$all_hour <-data.frame(new_column = names(right_columns )[as.matrix(right_columns )%*%seq_along(right_columns )],stringsAsFactors=FALSE)

Related

R rowSums and select inside mutate function dplyr

I am trying to select columns in my dataframe and add them together if they exist. I tried
newdf4 %>% select(any_of(contains(c("adSize_300 x 600","adSize_160 x 600","adSize_120 x 600","adSize_125 x 600")))) %>% mutate(vertical_sizes=rowSums(.))
It gives me a separate output of the two columns that exist and the new vertical_sizes column created:
adSize_300 x 600 adSize_160 x 600 vertical_sizes
1 1 0 1
2 0 0 0
3 0 0 0
4 0 0 0
5 0 1 1
6 0 0 0
7 1 0 1
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
However, I want the new vertical sizes column to be added to my original newdf4 dataframe.
I am trying:
newdf4 %>% mutate(vertical_sizes = rowSums(select(any_of(contains(c("adSize_300 x 600","adSize_160 x 600","adSize_120 x 600","adSize_125 x 600"))))))
But i receive this error:
Error in `mutate_cols()`:
! Problem with `mutate()` column `vertical_sizes`.
i `vertical_sizes = rowSums(...)`.
x `any_of()` must be used within a *selecting* function.
i See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
Caused by error:
! `any_of()` must be used within a *selecting* function.
i See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.
Run `rlang::last_error()` to see where the error occurred.
Please let me know if there is another approach to this. Thank you!
select inside of mutate requires data as its first argument, it will not infer or assume that it should look in the enclosing environment. For instance, when one does something like
mtcars %>%
select(cyl, mpg, disp)
The %>% is placing the data in the first argument of the RHS function, so this is more-aptly shown as
mtcars %>%
select(., cyl, mpg, disp)
where . indicates where the data is going. %>% allows one to specify where in the RHS expression the data should be placed (generally it works so long as the . is not placed within nested expressions).
Having said that, your use of select places the any_of in the first argument, which is (obviously) not going to work.
I suggest we use cur_data() (one of dplyr's context-dependent expressions) as the first argument.
Data:
read.table(text = '
adSize_300x600 adSize_160x600 vertical_sizes
1 1 0 1
2 0 0 0
3 0 0 0
4 0 0 0
5 0 1 1
6 0 0 0
7 1 0 1
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0')
Code:
newdf4 %>%
mutate(
vertical_sizes = rowSums(
select(cur_data(),
contains(c("adSize_300x600", "adSize_160x600", "adSize_120x600", "adSize_125x600")))
)
)
# adSize_300x600 adSize_160x600 vertical_sizes
# 1 1 0 1
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 0 1 1
# 6 0 0 0
# 7 1 0 1
# 8 0 0 0
# 9 0 0 0
# 10 0 0 0
# 11 0 0 0

Stacking multiple columns in R

I am trying to convert data frame into long-form in R.
This is an example data for surveys conducted in 'id' grids over 9 days and if the variable of interest was detected '1' or not detected '0'.
I want to convert this data frame so that the number of surveys is reduced from 9 to 3
but each survey period now contains 3 visits.
I am trying to do so by stacking three columns at a time, so that survey visits 'v1' to 'v9' (in the image below) gets converted to v1, v2, v3 by adding a column called 'visit_no' which describe the visit number within the survey period.
The following link is the image of dataframe in current form and below if the code to generate the data
Code to generate data:
id<- c(240,220,160)
v1<- c(rep(0,9))
v2<-c(rep(0,3),1,rep(0,5))
v3<- c(1,rep(0,8))
v<-as.data.frame(rbind(v1,v2,v3))
survey<- cbind(id,v)
survey
This is the link to the image of data frame I need
Reference data-frame
One way is using reshape in base R"
reshape(survey, direction="long", idvar="id",
varying=list(c("V1","V4","V7"), c("V2","V5","V8"), c("V3","V6","V9")),
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no")
id visit_no Visit1 Visit2 Visit3
240.1 240 1 0 0 0
220.1 220 1 0 0 0
160.1 160 1 1 0 0
240.2 240 2 0 0 0
220.2 220 2 1 0 0
160.2 160 2 0 0 0
240.3 240 3 0 0 0
220.3 220 3 0 0 0
160.3 160 3 0 0 0
If you want it sorted by id, then add arrange from dplyr
%>% dplyr::arrange(id)
id visit_no Visit1 Visit2 Visit3
1 160 1 1 0 0
2 160 2 0 0 0
3 160 3 0 0 0
4 220 1 0 0 0
5 220 2 1 0 0
6 220 3 0 0 0
7 240 1 0 0 0
8 240 2 0 0 0
9 240 3 0 0 0
If your original variable names were in a consistent format, then the reshape command is even simpler because it will correctly guess the times from the names. For example,
names(survey)[2:10] <- paste0(names(survey)[2:10], ".", rep(1:3, 3))
head(survey)
id V1.1 V2.2 V3.3 V4.1 V5.2 V6.3 V7.1 V8.2 V9.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", idvar="id",
varying=2:10, # Can just give the indices now.
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no") %>%
arrange(id)
Although the times are in a consistent format, the original variable names are not, so R cannot guess the names for the long format (Visit1, Visit2, Visit3), and these need to be supplied in the v.names argument.
If they were in a consistent format, then the reshape is even simpler.
names(survey)[2:10] <- paste0("Visit", rep(1:3, each=3), ".", rep(1:3, 3))
head(survey)
id Visit1.1 Visit1.2 Visit1.3 Visit2.1 Visit2.2 Visit2.3 Visit3.1 Visit3.2 Visit3.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", varying=2:10, timevar="visit_no") %>%
arrange(id)
The tidyr version would probably involve two reshapes; one to get everything in very long form, and again to get it back to a wider form (what I call the 1 step back, 2 steps forward method).
You can change the name of the columns based on the sequence that you want.
names(survey)[-1] <- paste(rep(paste0("visit", 1:3), each =3), 1:3, sep = "_")
names(survey)
#[1] "id" "visit1_1" "visit1_2" "visit1_3" "visit2_1" "visit2_2" "visit2_3"
# "visit3_1" "visit3_2" "visit3_3"
And then use pivot_longer from tidyr to get data in different columns.
tidyr::pivot_longer(survey, cols = -id, names_to = c(".value", "visit_no"),
names_sep = "_") %>%
type.convert(as.is = TRUE)
# A tibble: 9 x 5
# id visit_no visit1 visit2 visit3
# <int> <int> <int> <int> <int>
#1 240 1 0 0 0
#2 240 2 0 0 0
#3 240 3 0 0 0
#4 220 1 0 1 0
#5 220 2 0 0 0
#6 220 3 0 0 0
#7 160 1 1 0 0
#8 160 2 0 0 0
#9 160 3 0 0 0

Calling groups of string digits using grepl

I'm using the grepl function to try and sort through data; all the row numbers are different survey respondents, and each number in the "ANI_type" string represents a different type of animal - I need to sort these depending on animal type. More specifically, I need to group some of the digits in the strings into categories. For example, digits 6,7,8,9,10,11 all need to be placed in the animals$pock object. How would I go about that using the grep function?
> animals$dogs <- as.numeric(grepl("\\b1\\b", animals$ANI_type))
> animals
ANI_type dogs cats repamp
1 1,2,5,12,13,14,15,16,18,19,27 1 1 0
2 2 0 1 0
3 20,21,22,23,26 0 0 0
4 20,21,22,23 0 0 0
5 13 0 0 0
6 2 0 1 0
7 20,21,22 0 0 0
8 20,21,22,23 0 0 0
9 20,21,22 0 0 0
10 5,20,21,22,27 0 0 0
11 1,2,20,21,22 1 1 0
12 5,18,20,21,22,23,26 0 0 0
13 20,21 0 0 0
14 21 0 0 0
15 20,21 0 0 0
16 20,21,26 0 0 0
17 2 0 1 0
18 1,2 1 1 0
19 2 0 1 0
20 3,4 0 0 1
The expected output is the columns (dog, cat, repamp) above... these were easy to do as there is only one digit; what I'm having trouble with is splitting up multiples.
A tidyverse solution could be employed with mutate() and if_else() from the dplyr library, and grepl(), for example:
animals <- animals %>%
mutate(dogs = if_else(grepl("\\b1\\b|\\b22\\b", ANI_TYPE),
cats = if_else(grepl("\\b2\\b|\\b31\\b", ANI_TYPE))
In this case, you'd want to separate all the different potential codes for each animal using the pipe operator | which functions as an OR operator in R.

A problem for table function after filtering rows

In the case I need to create a cross tabulation of restaurant types against cluster. And I choose three types of restaurant, which are Indian, Korean, and Pizza.
However, there are many other types of restaurant. Therefore I use filter() to select my aim.
rest_3cui <- rest1 %>% filter(Cuisine %in% c('Indian', 'Korean', 'Pizza'))
table(rest_3cui[, c('Cuisine', 'cluster')])
However, it show like that:
cluster
Cuisine 1 2 3
Asian 0 0 0
Bakery 0 0 0
Burger 0 0 0
Cafe Food 0 0 0
Charcoal Chicken 0 0 0
Chinese 0 0 0
Coffee and Tea 0 0 0
Desserts 0 0 0
Dumplings 0 0 0
European 0 0 0
Fast Food 0 0 0
Fish and Chips 0 0 0
French 0 0 0
Greek 0 0 0
Hot Pot 0 0 0
Indian 4 0 1
Italian 0 0 0
Japanese 0 0 0
Korean 2 2 2
Korean BBQ 0 0 0
Lebanese 0 0 0
Malaysian 0 0 0
Mediterranean 0 0 0
Mexican 0 0 0
Modern Australian 0 0 0
Pizza 1 0 5
Portuguese 0 0 0
Pub Food 0 0 0
Ramen 0 0 0
Russian 0 0 0
Sandwich 0 0 0
Sushi 0 0 0
Thai 0 0 0
Tibetan 0 0 0
Vietnamese 0 0 0
which contain too many other useless messages instead of only three targets.
My intended table should be:
cluster
Cuisine 1 2 3
Indian 4 0 1
Korean 2 2 2
Pizza 1 0 5
Hope someone can help.
That is mostly because Cuisine is of type factor. Consider this example using mtcars data
df <- mtcars
df$cyl <- factor(df$cyl)
table(df[, c("cyl", "am")])
# am
#cyl 0 1
# 4 3 8
# 6 4 3
# 8 12 2
Now we remove one level
df2 <- subset(df, cyl != 4)
table(df2[, c("cyl", "am")])
# am
#cyl 0 1
# 4 0 0
# 6 4 3
# 8 12 2
We still have a row where cyl = 4. We need to drop that level. One way is to convert it to character
df2$cyl <- as.character(df2$cyl)
table(df2[, c("cyl", "am")])
# am
#cyl 0 1
# 6 4 3
# 8 12 2
Or we want to keep it as factor, use factor again so unused levels are dropped
df2$cyl <- factor(df2$cyl)
Or droplevels
df2$cyl <- droplevels(df2$cyl)
In your case, you can add droplevels after the filter and it should work fine.
library(dplyr)
rest_3cui <- rest1 %>%
filter(Cuisine %in% c('Indian', 'Korean', 'Pizza')) %>%
droplevels()

Get indices from each row and merge with original data.frame

I have the following data.frame
user_id 1 2 3 4 5 6 7 8 9
1 54449024717783 0 0 1 0 0 0 0 0 0
2 117592134783793 0 0 0 0 0 1 0 0 0
3 187145545782493 0 0 1 0 0 0 0 0 0
4 245003020993334 0 0 0 0 0 1 0 0 0
5 332625230637592 0 1 0 0 0 0 0 0 0
6 336336752713947 0 1 0 0 0 0 0 0 0
what I would like to do is to create one column (and remove 1:9) and insert the column name where I have the value 1 , each user contain only column with the value 1 ,
If im running the following function:
rowSums(users_cluster(users_cluster), dims = 1)
it will summarize all the rows value but I need to duplicate it with the column name
Base R solution:
data.frame(user_id = df[, 1],
name = which(t(df[, -1] == 1)) %% (ncol(df) - 1))
# user_id name
# 1 54449024717783 3
# 2 117592134783793 6
# 3 187145545782493 3
# 4 245003020993334 6
# 5 332625230637592 2
# 6 336336752713947 2
Here's another base R option:
inds <- which(df[,-1]!=0,TRUE)
df$newcol <- inds[order(row.names(inds)),][,2]
df[,c(1,11)]
# user_id newcol
#1 5.444902e+13 3
#2 1.175921e+14 6
#3 1.871455e+14 3
#4 2.450030e+14 6
#5 3.326252e+14 2
#6 3.363368e+14 2
Another approach is max.col from base R as the user specified each user contain only column with the value 1
cbind(dat[1], ind = max.col(dat[-1], 'first'))
# user_id ind
#1 54449024717783 3
#2 117592134783793 6
#3 187145545782493 3
#4 245003020993334 6
#5 332625230637592 2
#6 336336752713947 2
Another base R solution:
df$ind = apply(df[,-1]>0,1,which)
df[,c("user_id","ind")]
Output:
user_id ind
1 5.444902e+13 3
2 1.175921e+14 6
3 1.871455e+14 3
4 2.450030e+14 6
5 3.326252e+14 2
6 3.363368e+14 2
A solution using the tidyverse.
library(tidyverse)
dat2 <- dat %>%
mutate(ID = 1:n()) %>%
gather(Column, Value, -user_id, -ID) %>%
filter(Value == 1) %>%
arrange(ID) %>%
select(-Value, -ID) %>%
as.data.frame()
dat2
# user_id Column
# 1 54449024717783 3
# 2 117592134783793 6
# 3 187145545782493 3
# 4 245003020993334 6
# 5 332625230637592 2
# 6 336336752713947 2
DATA
dat <- read.table(text = " user_id 1 2 3 4 5 6 7 8 9
1 54449024717783 0 0 1 0 0 0 0 0 0
2 117592134783793 0 0 0 0 0 1 0 0 0
3 187145545782493 0 0 1 0 0 0 0 0 0
4 245003020993334 0 0 0 0 0 1 0 0 0
5 332625230637592 0 1 0 0 0 0 0 0 0
6 336336752713947 0 1 0 0 0 0 0 0 0",
header = TRUE, stringsAsFactors = FALSE)
library(tidyverse)
dat <- as.tibble(dat) %>%
setNames(sub("X", "", names(.))) %>%
mutate(user_id = as.character(user_id))
For the sake of completeness, here is also a data.table solution which uses melt() to reshape from wide to long format:
library(data.table)
melt(setDT(DF), id = "user_id")[value == 1L][order(user_id), !"value"]
user_id variable
1: 54449024717783 3
2: 117592134783793 6
3: 187145545782493 3
4: 245003020993334 6
5: 332625230637592 2
6: 336336752713947 2
This takes advantage of the fact that the sample dataset is already sorted by ascending user_id.
In case the sample dataset has a different order which should be maintained in the final result, it is necessary to remember that order by introducing a temporary row id:
melt(setDT(DF), id = "user_id")[, rn := rowid(variable)][value == 1L][
order(rn), !c("rn", "value")]
or, alternatively,
melt(setDT(DF), id = "user_id")[, rn := rowid(variable)][, setorder(.SD, rn)][
value == 1L, !c("rn", "value")]
Data
library(data.table)
DF <- fread(
"i user_id 1 2 3 4 5 6 7 8 9
1 54449024717783 0 0 1 0 0 0 0 0 0
2 117592134783793 0 0 0 0 0 1 0 0 0
3 187145545782493 0 0 1 0 0 0 0 0 0
4 245003020993334 0 0 0 0 0 1 0 0 0
5 332625230637592 0 1 0 0 0 0 0 0 0
6 336336752713947 0 1 0 0 0 0 0 0 0"
, drop = 1L)[, lapply(.SD, as.integer), by = user_id]

Resources