How to rearrange columns of a data frame based on values in a row - r

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.

We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

Related

How to compute the NAs with the column mean and then multiply columns of different lengths in R?

My question might be not so clear so I am putting an example.
My final goal is to produce
final=(df1$a*df2$b)+(df1$a*df3$c*df4$d)+(df4$d*df5$e)
I have five data frames (one column each) with different lengths as follows:
df1
a
1. 1
2. 2
3. 4
4. 2
df2
b
1. 2
2. 6
df3
c
1. 2
2. 4
3. 3
df4
d
1. 1
2. 2
3. 4
4. 3
df5
e
1. 4
2. 6
3. 2
So I want a final database which includes them all as follows
finaldf
a b c d e
1. 1 2 2 1 4
2. 2 6 4 2 6
3. 4 NA 3 4 2
4. 2 NA NA 3 NA
I want all the NAs for each column to be replaced with the mean of that column, so the finaldf has equal length of all the columns:
finaldf
a b c d e
1. 1 2 2 1 4
2. 2 6 4 2 6
3. 4 4 3 4 2
4. 2 4 3 3 4
and therefore I can produce a final result for final=(df1$a*df2$b)+(df1$a*df3$c*df4$d)+(df4$d*df5$e) as I need.
The easiest by far is to use the qpcR, dplyr and tidyr packages.
library(dplyr)
library(qpcR)
library(tidyr)
df1 <- data.frame(a=c(1,2,4,2))
df2 <- data.frame(b=c(2,6))
df3 <- data.frame(c=c(2,4,3))
df4 <- data.frame(d=c(1,2,4,3))
df5 <- data.frame(e=c(4,6,2))
mydf <- qpcR:::cbind.na(df1, df2, df3, df4,df5) %>%
tidyr::replace_na(.,as.list(colMeans(.,na.rm=T)))
> mydf
a b c d e
1 1 2 2 1 4
2 2 6 4 2 6
3 4 4 3 4 2
4 2 4 3 3 4
Depending on your rgl settings, you might need to run the following at the top of your script to make the qpcR package load (see https://stackoverflow.com/a/66127391/2554330 ):
options(rgl.useNULL = TRUE)
library(rgl)
With purrr and dplyr, we can first put all dataframes in a list with mget(). Second, use set_names to replace the dataframe names with their respective column names. As a third step, unlist the dataframes to get vectors with pluck. Then add the NAs by making all vectors the same length.
Finally, bind all vectors back into a dataframe with as.data.frame, then use mutate with ~replace_na and colmeans.
library(dplyr)
library(purrr)
mget(ls(pattern = 'df\\d')) %>%
set_names(map_chr(., colnames)) %>%
map(pluck, 1) %>%
map(., `length<-`, max(lengths(.))) %>%
as.data.frame %>%
mutate(across(everything(), ~replace_na(.x, mean(.x, na.rm=TRUE))))

How to extract a list of columns name based on the means of their data?

I'm pretty new to R and hope i'll make myself clear enough.
I have a table of several columns which are factors. I want to make a score for each of these columns. Then I want to calculate the mean of each score, and display the list of columns ranked by their mean scores, is that possible ?
Table would be:
head(musico[,69:73])
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4
I want to make a score for each:
musico$score1<-0
musico$score1[musico$AVIS1==1]<-1
musico$score1[musico$AVIS1==2]<-0.5
then do the mean of each column score: mean of score1, mean of score2, ...:
mean(musico$score1), mean(musico$score2), ...
My goal is to have a list of titles (avis1, avis2,...) ranked by their mean score.
Any advice appreciated !
Here's one way using base although it is somewhat unclear what you want. What does score1 have to do with AVIS1? I think you may be missing some of the data from musico.
Based on the example provided, here's a base R solution. vapply loops through the data.frame and produces the mean for each column. Then the stack and order are only there to make the output a dataframe that looks nice.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE)
means <- vapply(music, mean, 1)
stack(means[order(means, decreasing = TRUE)])
values ind
4 4.000000 AVIS4
3 3.166667 AVIS3
2 2.666667 AVIS2
5 2.500000 AVIS5
1 2.166667 AVIS1
This is how I would do it by first introducing a scores vector to be used as a lookup. I assume that scores are decreasing by 0.5 and that the number of scores needed are according to the maximum number of levels found in your columns (i.e. 6 seen in AVIS1).
Then using tidyr you can organise your data set such that you have to variables (i.e. AVIS and Value) containing the respective levels. Then add a score variable with the mutate function from dplyr in which the position of the score in the score vector matches the value in the Value variable. From here you can find the mean scores corresponding to the AVIS levels, arrange them accordingly and put them in a list.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE) # your data
scores <- seq(1, by = -0.5, length.out = 6) # vector of scores
library(tidyr)
library(dplyr)
music2 <- music %>%
gather(AVIS, Value) %>% # here you tidy the data
mutate(score = scores[Value]) %>% # match score to value
group_by(AVIS) %>% # group AVIS levels
summarise(score.mean = mean(score)) %>% # find mean scores for AVIS levels
arrange(desc(score.mean))
list <- list(AVIS = music2$AVIS) # here is the list
> list$AVIS
[1] "AVIS1" "AVIS5" "AVIS2" "AVIS3" "AVIS4"

How to merge columns in R with different levels of values

I have been given a dataset that I am attempting to perform logistic regression on. However, to do so, I need to merge some columns in R.
For instance in the carevaluations data set, I am given (BuyingPrice_low, BuyingPrice_medium, BuyingPrice_high, BuyingPrice_vhigh, MaintenancePrice_low MaintenancePrice_medium MaintenancePrice_high MaintenancePrice_vhigh)
How would I combine the columns buying price_low, medium, etc. into one column called "BuyingPrice" with the order and their respective data in each column and the same with the maintenanceprice column?
library(dplyr)
df <- data.frame(Buy_low=rep(c(0,1), 10),
Buy_high=rep(c(0,1), 10))
one_column <- df %>%
gather(var, value)
head(one_column)
var value
1 Buy_low 0
2 Buy_low 1
3 Buy_low 0
4 Buy_low 1
5 Buy_low 0
6 Buy_low 1
It can be done with stack in base R :
df1 <- data.frame(a=1:3,b=4:6,c=7:9)
stack(df1)
# values ind
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c

How to take the latest entry from a data.frame and store it in new dataframe

I have a data.frame that is full of data, and where the data for parameters repeat itself, but I want to use the latest information that is stored.
Thankfully I have an index in the files that tells me which duplicate is he current row in the data.frame.
Example for my problem is the following:
A B C D
1 1 2 3 1
2 1 2 2 2
3 3 4 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
A small explanation ... A and B columns can be considered key, and the C column represents value for that key ... the column D represents the index of the measurement .. but it does not have to start from 1 ... it can start from 3,6, ... any integer. This is happening because the data is incomplete
So at the end the output should be like:
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
Can you please help me program a make an R program, or point me to the right direction, that is going to save all the keys with the their latest index ...
I have tried using for loops but it didn't work ....
Sincerely thanks
If you have any question feel free to ask
Using duplicated and subsetting in base R, you can do
dat[!duplicated(dat[,1:2], fromLast=TRUE),]
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
duplicated returns a logical vector indicating whether a row (here the first two columns) has been duplicated. The fromLast argument initiates this process from the bottom of the data.frame.
You can use dplyr verbs to group your data group_by, then sort arrange. The do verb allows you to operate at the group-level. tail grabs the last row of each group...
library(dplyr)
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
do(tail(.,1)) %>%
ungroup()
Thanks to Frank's suggestion, you could also use slice
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
slice(n()) %>%
ungroup()

Double left join in dplyr to recover values

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.
The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0
First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Resources