Order data frame by columns in increasing and decreasing order - r

I'm trying to figure out how to sort a data frame like the one below by c1 in decreasing order and c2 in increasing order.
c1 <- c("a", "b", "c", "d", "d", "e", "f", "g", "h", "i")
c2 <- c("29-JAN-08", "29-JAN-08", "29-JAN-08", "29-JAN-08", "20-MAR-08", "28-MAR-08", "28-MAR-08", "28-MAR-08", "28-MAR-08", "28-MAR-08")
example <- data.frame(c1, c2)
I can't use the - sign with a date vector:
> example <- example[order(example$c1, -example$c2),]
Error: unexpected input in "example <- example[order(example$c1, -1ex"
And I haven't been able to figure out how to use the 'decreasing' argument:
> example <- example[order(example$c1, example$c2, decreasing = c(F, T)),]
Error: unexpected input in "example <- example[order(example$c1, -1ex"
Is there a way I can order this data frame by these two columns, in increasing order by the first one and decreasing order by the second when the columns are character and date types, respectively?

Here's an answer using the data.table package, which shows off it's benefits in terms of cleaner code:
example <- as.data.table(example)
# set the date variable as an actual date first
example$c2 <- as.Date(example$c2,format="%d-%b-%Y")
# then sort - notice no need to keep referencing example$...
example[order(c1,-as.numeric(c2))]
A base R version of how to do this would use with
example[with(example,order(c1,-as.numeric(c2))),]

This would do the reverse lexical sort, but it may not be what you were intending since you have not yet converted to Date values, since the reverse sorting will first be be done on the character day "field":
example[ order(example$c1, rev(example$c2)) , ]
#-------
c1 c2
1 a 29-JAN-08
2 b 29-JAN-08
3 c 29-JAN-08
4 d 29-JAN-08
5 d 20-MAR-08
6 e 28-MAR-08
7 f 28-MAR-08
8 g 28-MAR-08
9 h 28-MAR-08
10 i 28-MAR-08
If you want to do the sort in reverse "true" date-order:
example[ order(example$c1, -as.numeric(as.Date(example$c2, format="%d-%b-%Y"))) , ]
#-----
c1 c2
1 a 29-JAN-08
2 b 29-JAN-08
3 c 29-JAN-08
5 d 20-MAR-08
4 d 29-JAN-08
6 e 28-MAR-08
7 f 28-MAR-08
8 g 28-MAR-08
9 h 28-MAR-08
10 i 28-MAR-08
9 h 28-MAR-08
10 i 28-MAR-08

Related

How to compare two variable and different length data frames to add values from one data frame to the other, repeating values where necessary

I apologize as I'm not sure how to word this title exactly.
I have two data frames. df1 is a series of paths with columns "source" and "destination". df2 stores values associated with the destinations. Below is some sample data:
df1
row
source
destination
1
A
B
2
C
B
3
H
F
4
G
B
df2
row
destination
n
1
B
26
2
F
44
3
L
12
I would like to compare the two data frames and add the n column to df1 so that df1 has the correct n value for each destination. df1 should look like:
row
source
destination
n
1
A
B
26
2
C
B
26
3
H
F
44
4
G
B
26
The data that I'm actually working with is much larger, and is never the same number of rows when I run the program. The furthest I've gotten with this is using the which command to get the right values, but only each value once.
df2[ which(df2$destination %in% df1$destination), ]$n
[1] 26 44
When what I would need is the list (26,26,44,26) so I can save it to df1$n
We can use a merge or left_join
library(data.table)
setDT(df1)[df2, n := i.n, on = .(destination)]
A base R option using match
transform(
df1,
n = df2$n[match(destination, df2$destination)]
)
which gives
row source destination n
1 1 A B 26
2 2 C B 26
3 3 H F 44
4 4 G B 26
Data
df1 <- data.frame(row = 1:4, source = c("A", "C", "H", "G"), destination = c("B", "B", "F", "B"))
df2 <- data.frame(row = 1:3, destination = c("B", "F", "L"), n = c(26, 44, 12))

Extract subset of data

Ok, I have a matrix of values with certain identifiers, such as:
A 2
B 3
C 4
D 5
E 6
F 7
G 8
I would like to pull out a subset of these values (using R) based on a list of the identifiers ("B", "D", "E") for example, so I would get the following output:
B 3
D 5
E 6
I'm sure there's an easy way to do this (some sort of apply?) but I can't seem to figure it out. Any ideas? Thanks!
If the letters are the row names, then you can just use this:
m <- matrix(2:8, dimnames = list(LETTERS[1:7], NULL))
m[c("B","D","E"),]
# B D E
# 3 5 6
Note that there is a subtle but very important difference between: m[c("B","D","E"),] and m[rownames(m) %in% c("B","D","E"),]. Both return the same rows, but not necessarily in the same order.
The former uses the character vector c("B","D","E") as in index into m. As a result, the rows will be returned in the order of character vector. For instance:
# result depends on order in c(...)
m[c("B","D","E"),]
# B D E
# 3 5 6
m[c("E","D","B"),]
# E D B
# 6 5 3
The second method, using %in%, creates a logical vector with length = nrow(m). For each element, that element is T if the row name is present in c("B","D","E"), and F otherwise. Indexing with a logical vector returns rows in the original order:
# result does NOT depend on order in c(...)
m[rownames(m) %in% c("B","D","E"),]
# B D E
# 3 5 6
m[rownames(m) %in% c("E","D","B"),]
# B D E
# 3 5 6
This is probably more than you wanted to know...
Your matrix:
> m <- matrix(2:8, dimnames = list(LETTERS[1:7]))
You can use %in% to filter out the desired rows. If the original matrix only has a single column, using drop = FALSE will keep the matrix structure. Otherwise it will be converted to a named vector.
> m[rownames(m) %in% c("B", "D", "E"), , drop = FALSE]
# [,1]
# B 3
# D 5
# E 6

Merging data frames with a non-unique column

I would like to create a new data frame that borrows an ID variable from another data frame. The data frame I would like to merge has repeated observations in the ID column which is causing me some problems.
DF1<-data.frame(ID1=rep(c("A","B", "C", "D", "E") , 2), X1=rnorm(10))
DF2<-data.frame(ID1=c("A", "B", "C", "D", "E"), ID2=c("V","W","X","Y" ,"Z"), X2=rnorm(5), X3=rnorm(5))
What I would like to append DF2$ID2 onto DF by the ID1 column. My goal is something that looks like this (I do not want DF2$X2 and DF$X3 in the 'Goal' data frame):
Goal<-data.frame(ID2=DF2$ID2, DF1)
I have tried merge but it complains because DF1$ID1 is not unique. I know R can goggle this up in 1 line of code but I can't seem to make the functions I know work. Any help would be greatly appreciated!
There should be no problem with a simple merge. Using your sample data
merge(DF1, DF2[,c("ID1","ID2")], by="ID1")
produces
ID1 X1 ID2
1 A 0.03594331 V
2 A 0.42814900 V
3 B -2.17161263 W
4 B -0.33403550 W
5 C 0.95407844 X
6 C -0.23186723 X
7 D 0.46395514 Y
8 D -1.49919961 Y
9 E -0.20342430 Z
10 E -0.49847569 Z
You could also use left_join from library(dplyr)
library(dplyr)
left_join(DF1, DF2[,c("ID1", "ID2")])
# ID1 X1 ID2
#1 A -1.20927237 V
#2 B -0.03003128 W
#3 C -0.75799708 X
#4 D 0.53946986 Y
#5 E -0.52009921 Z
#6 A 1.15822659 V
#7 B -0.91976194 W
#8 C 0.74620142 X
#9 D -2.46452560 Y
#10 E 0.80015219 Z

Building a function to identify elements in rows of dataframe in R and replace them after a comparison

Hi everybody I am trying to solve a little problem in R. I have a dataframe in R with code variable and five variables. It looks like this (I add dput() version at the end
):
Code C1 C2 C3 C4 C5
1 abc1 A A A A A
2 bbb1 B Mark C C C
3 cc2 C C Mark D D
4 ccc3 D Mark E Mark E
5 ddd1 A Mark B B B
6 ddd1 Mark Mark B B B
My problem is with rows, code is a variable only for reference. The thing I want to solve is the next. In each row is possible to have the string Mark. When Mark is found at any row, then I have to make a comparison, first between Mark and the element that is located before the position of Mark, and second between Mark and the element that is located after the position of Mark. In other words when I find Mark I have to compare this with the element that is in the position of Mark plus 1 and with the element that is in the position of Mark less 1. In both comparison I have to evaluate if Mark is different to the element in position of Mark less one and if Mark is different to the element in position of Mark plus one. If Mark is different to both elements then I have to replace Mark with the same element in position Mark plus one. For example in row number two I had B, Mark and C. The function I tried to write should make this: First identify if Mark is in the row, second it will compare Mark with the element in position Mark-1, in this case B, third it will compare Mark with the element in position Mark+1 in this case C. In first comparison Mark is different to B and in the second Mark is differente to C. This case satisfies both comparisons, the Mark will be replaced by the element in the position Mark+1, in the example would be C. I made a function but I don't know what is wrong. My dataframes is test. The function is:
test[-1] <- t(apply(
test[-1],
1,
function(x) {
if(x=="Mark" & x!=x[which(x)-1] & x!=x[which(x)+1]) {
x=x[which(x)+1]
} else
x
}
))
When I apply this over test I got this error:
Error in which(x) : argument to 'which' is not logical
I tried to fix the logic inside the function but it doesn't work. I know which() detect position but here give me error. I would like to get something like this:
Code C1 C2 C3 C4 C5
1 abc1 A A A A A
2 bbb1 B C C C C
3 cc2 C C D D D
4 ccc3 D E E E E
5 ddd1 A B B B B
6 ddd1 Mark Mark B B B
I would like to identify what is wrong in the function. The dput version of test is the next:
structure(list(Code = c("abc1", "bbb1", "cc2", "ccc3", "ddd1",
"ddd1"), C1 = c("A", "B", "C", "D", "A", "Mark"), C2 = c("A",
"Mark", "C", "Mark", "Mark", "Mark"), C3 = c("A", "C", "Mark",
"E", "B", "B"), C4 = c("A", "C", "D", "Mark", "B", "B"), C5 = c("A",
"C", "D", "E", "B", "B")), .Names = c("Code", "C1", "C2", "C3",
"C4", "C5"), row.names = c(NA, 6L), class = "data.frame")
Many thanks for your help.
I'm not certain, but it seems like this might do the trick for you. I usually steer clear of which because it makes things a bit more confusing than it has to be. We can solve this problem simply by using the position numbers of the data, when converted into matrix form and from factor to character variables.
## get the data
> dat <- read.table(header = TRUE, text = "Code C1 C2 C3 C4 C5
1 abc1 A A A A A
2 bbb1 B Mark C C C
3 cc2 C C Mark D D
4 ccc3 D Mark E Mark E
5 ddd1 A Mark B B B
6 ddd1 Mark Mark B B B", row.names = 1)
## manipulate using the position numbers
> dat <- sapply(dat, as.character)
> nr <- nrow(dat)
> gg <- grep("Mark", dat)
> dat[gg] <- sapply(seq(dat[gg]), function(i){
ifelse(dat[gg+nr][i] > dat[gg-nr][i], dat[gg+nr][i], dat[gg-nr][i])
})
> as.data.frame(dat)
## Code C1 C2 C3 C4 C5
## 1 abc1 A A A A A
## 2 bbb1 B C C C C
## 3 cc2 C C D D D
## 4 ccc3 D E E E E
## 5 ddd1 A B B B B
## 6 ddd1 Mark Mark B B B
It's not clear what the column previous to C1 is, or after C5, so it's ambiguous how we should deal with Marks in those columns, but assuming you're happy in those cases if the one neighbour is the same, then:
cols <- sprintf("C%d",1:5)
colsLeft <- c(cols[-1], cols[length(cols)])
colsRight <- c(cols[1], cols[-length(cols)])
comp <- df[cols]=="Mark" & df[colsRight]!="Mark" & df[colsLeft]!="Mark"
df[cols][comp] <- df[colsLeft][comp]
The idea being to set up two 'shifts' of the data frame, one to the right, one to the left. Comp then gets set when the original is Mark, but neither of the shifted versions are. Then set these cells of df to be the corresponding cells of the left-shifted one.

R - How to apply different functions to certain rows in a column

I am trying to apply different functions to different rows based on the value of a string in an adjacent column. My dataframe looks like this:
type size
A 1
B 3
A 4
C 2
C 5
A 4
B 32
C 3
and I want to apply different functions to types A, B, and C, to give a third column column "size2." For example, let's say the following functions apply to A, B, and C:
for A: size2 = 3*size
for B: size2 = size
for C: size2 = 2*size
I'm able to do this for each type separately using this code
df$size2 <- ifelse(df$type == "A", 3*df$size, NA)
df$size2 <- ifelse(df$type == "B", 1*df$size, NA)
df$size2 <- ifelse(df$type == "C", 2*df$size, NA)
However, I can't seem to do it for all of the types without erasing all of the other values. I tried to use this code to limit the application of the function to only those values that were NA (i.e., keep existing values and only fill in NA values), but it didn't work using this code:
df$size2 <- ifelse(is.na(df$size2), ifelse(df$type == "C", 2*df$size, NA), NA)
Does anyone have any ideas? Is it possible to use some kind of AND statement with "is.na(df$size2)" and "ifelse(df$type == "C""?
Many thanks!
This might be a might more R-ish (and I called my dataframe 'dat' instead of 'df' since df is a commonly used function.
> facs <- c(3,1,2)
> dat$size2= dat$size* facs[ match( dat$type, c("A","B","C") ) ]
> dat
type size size2
1 A 1 3
2 B 3 3
3 A 4 12
4 C 2 4
5 C 5 10
6 A 4 12
7 B 32 32
8 C 3 6
The match function is used to construct indexes to supply to the extract function [.
if you want you can nest the ifelses:
df$size2 <- ifelse(df$type == "A", 3*df$size,
ifelse(df$type == "B", 1*df$size,
ifelse(df$type == "C", 2*df$size, NA)))
# > df
# type size size2
#1 A 1 3
#2 B 3 3
#3 A 4 12
#4 C 2 4
#5 C 5 10
#6 A 4 12
#7 B 32 32
#8 C 3 6
This could do it like this, creating separate logical vectors for each type:
As <- df$type == 'A'
Bs <- df$type == 'B'
Cs <- df$type == 'C'
df$size2[As] <- 3*df$size[As]
df$size2[Bs] <- df$size[Bs]
df$size2[Cs] <- 2*df$size[Cs]
but a more direct approach would be to create a separate lookup table like this:
df$size2 <- c(A=3,B=1,C=2)[as.character(df$type)] * df$size

Resources