How to merge columns in R with different levels of values - r

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?

library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1

It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

Related

R - How to create multiple datasets based on levels of factor in multiple columns?

I'm kinda new to R and still looking for ways to make my code more elegant. I want to create multiple datasets in a more efficient way, each based on a particular value over different columns.
This is my dataset:
df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
B=c(4,4,2,3,4,2,1,5,2,2),
C=c(3,3,3,3,4,2,5,1,2,3),
D=c(1,2,5,5,5,4,5,5,2,3),
E=c(1,4,2,3,4,2,5,1,2,3),
dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
dummy2=c("high","low","low","low","high","high","high","low","low","high"))
And I need each column to be a factor:
df[colnames(df)] <- lapply(df[colnames(df)], factor)
Now, what I want to obtain is one dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes", one dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no", one dataframe called "Likert_rank_high" that contains all the observations that in the column "dummy2" have "high" and so on for all my other dummies.
I want to loop or streamline the process in some way, so that there are few commands to run to get all the datasets I need.
The first two dataframes should look something like this:
Dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes"
Dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no"
I have to do this with several dummies with multiple levels and would like to automate/loop the process or make it more efficient, so that I don't have to subset and rename every dataframe for each dummy level. Ideally I would also need to drop the last column in each df created (the one containing the dummy considered).
I tried splitting like below but it seems it is not possible using multiple values, I just get 4 dfs (yes AND high observations, yes AND low obs, no AND high obs etc.) like so:
Splitting with a list of columns doesn't work
list_df <- split(df[c(1:5)], list(df$dummy1,df$dummy2), sep=".")
Can you help? Thanks in advance!
You need two lapplys:
vals <- colnames(df)[1:5]
dummies <- colnames(df)[-(1:5)]
step1 <- lapply(dummies, function(x) df[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
step2
# $dummy1
# $dummy1$no
# A B C D E dummy1
# 3 2 2 3 5 2 no
# 4 3 3 3 5 3 no
# 5 4 4 4 5 4 no
# 6 5 2 2 4 2 no
# 8 1 5 1 5 1 no
#
# $dummy1$yes
# A B C D E dummy1
# 1 1 4 3 1 1 yes
# 2 2 4 3 2 4 yes
# 7 1 1 5 5 5 yes
# 9 2 2 2 2 2 yes
# 10 3 2 3 3 3 yes
#
#
# $dummy2
# $dummy2$high
# A B C D E dummy2
# 1 1 4 3 1 1 high
# 5 4 4 4 5 4 high
# 6 5 2 2 4 2 high
# 7 1 1 5 5 5 high
# 10 3 2 3 3 3 high
#
# $dummy2$low
# A B C D E dummy2
# 2 2 4 3 2 4 low
# 3 2 2 3 5 2 low
# 4 3 3 3 5 3 low
# 8 1 5 1 5 1 low
# 9 2 2 2 2 2 low
For the first data set ("dummy1" and "no") use step2$dummy1$no or step2[[1]][[1]] or step2[["dummy1"]][["no"]].
For programming purposes it is usually better to keep the list intact since it makes it simple to write code that processes all of the data frames in the list without having to specify them individually.
You are very close:
tbls <- unlist(step2, recursive=FALSE)
list2env(tbls, envir=.GlobalEnv)
ls()
# [1] "df" "dummies" "dummy1.no" "dummy1.yes" "dummy2.high" "dummy2.low" "step1" "step2" "tbls" "vals"
This will create the same set of tables.

How to rearrange columns of a data frame based on values in a row

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.
We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

Combining two columns using shared values in first column

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?
Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

How to combine two data frames using dplyr or other packages?

I have two data frames:
df1 = data.frame(index=c(0,3,4),n1=c(1,2,3))
df1
# index n1
# 1 0 1
# 2 3 2
# 3 4 3
df2 = data.frame(index=c(1,2,3),n2=c(4,5,6))
df2
# index n2
# 1 1 4
# 2 2 5
# 3 3 6
I want to join these to:
index n
1 0 1
2 1 4
3 2 5
4 3 8 (index 3 in two df, so add 2 and 6 in each df)
5 4 3
6 5 0 (index 5 not exists in either df, so set 0)
7 6 0 (index 6 not exists in either df, so set 0)
The given data frames are just part of large dataset. Can I do it using dplyr or other packages in R?
Using data.table (would be efficient for bigger datasets). I am not changing the column names, as the rbindlist uses the name of the first dataset ie. in this case n from the second column (Don't know if it is a feature or bug). Once you join the datasets by rbindlist, group it by column index i.e. (by=index) and do the sum of n column (list(n=sum(n)) )
library(data.table)
rbindlist(list(data.frame(index=0:6,n=0), df1,df2))[,list(n=sum(n)), by=index]
index n
#1: 0 1
#2: 1 4
#3: 2 5
#4: 3 8
#5: 4 3
#6: 5 0
#7: 6 0
Or using dplyr. Here, the column names of all the datasets should be the same. So, I am changing it before binding the datasets using rbind_list. If the names are different, there will be multiple columns for each name. After joining the datasets, group it by index and then use summarize and do the sum of column n.
library(dplyr)
nm1 <- c("index", "n")
colnames(df1) <- colnames(df2) <- nm1
rbind_list(df1,df2, data.frame(index=0:6, n=0)) %>%
group_by(index) %>%
summarise(n=sum(n))
This is something you could do with the base functions aggregate and rbind
df1 = data.frame(index=c(0,3,4),n=c(1,2,3))
df2 = data.frame(index=c(1,2,3),n=c(4,5,6))
aggregate(n~index, rbind(df1, df2, data.frame(index=0:6, n=0)), sum)
which returns
index n
1 0 1
2 1 4
3 2 5
4 3 8
5 4 3
6 5 0
7 6 0
How about
names(df1) <- c("index", "n") # set colnames of df1 to target
df3 <- rbind(df1,setNames(df2, names(df1))) # set colnnames of df2 and join
df <- df3 %>% dplyr::arrange(index) # sort by index
Cheers.

Resources