How to sort a dataframe in decreasing order lapply and sort in r - r

Not sure if this is a duplicate but I couldn't find anything that either solves my original problem or the issue I'm running into with the partial I did find.
The goal is to sort a dataframe independently by column.
Reproducible example
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a
name date1 date2 date3
1 a 2 0 0
2 a 3 2 2
3 a 1 3 0
4 b 3 1 3
5 b 1 2 2
6 b 2 0 1
b <- ddply(a, "name", function(x) { as.data.frame(lapply(x, sort))
b
name date1 date2 date3
1 a 1 0 0
2 a 2 2 0
3 a 3 3 2
4 b 1 0 1
5 b 2 1 2
6 b 3 2 3
Now this works as expected, but is the opposite of what I'm looking to do.
Desired output
b
name date1 date2 date3
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
I've tried to add in the decreasing=T parameter but haven't had any luck with the variations I've tried and usually end up with an error about missing arguments or undefined columns being selected. How does one correctly implement a decreasing sort with this syntax and/or otherwise achieve the end result without relying on explicitly naming the columns (they names are dates so change often)
Bonus
How could this code be adapted to account for NA's with na.last
Thank you!

I think you nuked the data.frame rows with your code, not a very good practice standard dplyr use the arrange() function like this
library(tidyverse)
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a %>%
arrange(name,-date1)
If you want to live a dangerous life here is the code for it
a %>%
group_by(name) %>%
mutate_all(sort,decreasing = TRUE)
name date1 date2 date3
<fct> <dbl> <dbl> <dbl>
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1

A solution with the data.table package is the following
library(data.table)
a <- data.table(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# alternatively:
# a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# setDT(a)
b <- a[, lapply(.SD, sort, decreasing = TRUE), by = name]
.SD returns the subset of data, in this case created with the by = name. It splits the original data.table by the values in the given column.
This also fulfills your bonus requirement, the na.last can be supplied.
aa <- data.table(name = c("a","a","a","b","b","b"),date1 = c(NA,3,1,3,1,NA),date2 = c(0,2,NA,1,2,0),date3 = c(0,2,0,3,2,NA))
bb <- aa[, lapply(.SD, sort, decreasing = TRUE, na.last = TRUE), by = name]

Related

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

Finding maximum value across multiple columns (pmax gives NAs)

Using reshape function, I have data that looks something like this:
df1
ID Score1 Score2 Score3
1 2 3 1
2 3 2 1
3 2 1 NA
4 1 NA NA
As you can see, some of my score variables have missing values.
I'm interested in finding the maximum score variable for all ID values. When I tried using pmax(df1$Score1,df1$Score2,df1$Score3), my resulting vector contains NAs. I'm not sure why this is, as I know that my Score1 variable doesn't contain any NAs.
This is what I'd like my output to accomplish:
ID MaxScore
1 3
2 3
3 2
4 1
Thanks
You could use apply on each row (MARGIN = 1)
apply(X = df1[,-1], MARGIN = 1, FUN = max, na.rm = TRUE)
#[1] 3 3 2 1
We can use the vectorized pmax to get this done
cbind(df1['ID'], MaxScore = do.call(pmax, c(df1[-1], na.rm = TRUE)))
# ID MaxScore
#1 1 3
#2 2 3
#3 3 2
#4 4 1

Summary of values across rows and columns in R

I have a dataset that looks like:
Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4
I want the summary of this dataset across rows and columns for the count of each value as below:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The count of 4 in the dataset for group XYZ across all rows and columns is 1, for 2 and 1 its 2, for 3 its 1. I can do this by creating 4 new columns 4,3,2,1 and getting the count row wise and then column wise, but this is not efficient and scalable. I am sure there is a better way to get this done.
Using reshape2 package we can melt and dcast as follows,
library(reshape2)
dcast(na.omit(melt(df, id.vars = 'Group')), Group ~ value, fun.aggregate = length)
# Group 1 2 3 4
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1
This uses no packages and is just one line. Here DF$Group[row(DF[-1])] is a Group labels vector such that each element corresponds to the unravelled numeric vector unlist(DF[-1]).
table(DF$Group[row(DF[-1])], unlist(DF[-1]))
giving:
1 2 3 4
DEF 3 1 3 1
PQR 2 2 1 1
XYZ 2 2 1 1
If the order of rows and columns shown in the question is important then to we can create factors from each of the two table arguments with the factor levels being defined in the orders desired. In this case we use the following line instead of the line of code above:
table(Group = factor(DF$Group[row(DF[-1])], unique(DF$Group)), factor(unlist(DF[-1]), 4:1))
giving:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The above produces an object of class "table". This is a particularly suitable class for tabulated frequencies. For example, once in this form ftable can be used to easily rearrange it further as in ftable(tab, row.vars = 2) or ftable(tab, row.vars = 1:2) where tab is the above computed table.
If a data.frame were preferred then convert it like this:
cbind(Group = rownames(tab), as.data.frame.matrix(tab))
The input data.frame DF is defined reproducibly in Note 2 at the end.
Alternatives
Although the above seems the most direct here are some other alternatives that also use no packages:
1) by For each set of rows having the same Group value the anonymous function creates a data.frame identifying the Group, converting the columns other than the first to a factor with the indicated levels and running table to get the counts. The "by" list that is returned is sorted back to the original order and we rbind everything back together.
do.call("rbind",
by(DF, DF$Group, function(x) {
data.frame(Group = x[1,1],
as.list(table(factor(unlist(x[, -1]), levels = 4:1))),
check.names = FALSE)
})[unique(DF$Group)])
giving:
Group 4 3 2 1
XYZ XYZ 1 1 2 2
DEF DEF 1 3 1 3
PQR PQR 1 1 2 2
1a) This slightly shorter variation would also work. It returns a matrix identifying the groups using row names.
kount <- function(x) table(factor(unlist(x), levels = 4:1))
m <- do.call("rbind", by(DF[, -1], DF$Group, kount)[unique(DF$Group)])
giving:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
2) outer
gps <- unique(DF$Group)
levs <- 4:1
kount2 <- function(g, lv) sum(subset(DF, Group == g)[-1] == lv, na.rm = TRUE)
m <- outer(gps, levs, Vectorize(kount2))
dimnames(m) <- list(gps, levs))
giving this matrix:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
3) sapply
kount3 <- function(g) table(factor(unlist(DF[DF$Group == g, -1]), levels = 4:1))
gps <- as.character(unique(DF$Group))
do.call("rbind", sapply(gps, kount3, simplify = FALSE))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
4) aggregate
aggregate(1:nrow(DF), DF["Group"], function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1)))[unique(DF$Group), ]
giving:
Group x.4 x.3 x.2 x.1
3 XYZ 1 1 2 2
1 DEF 1 3 1 3
2 PQR 1 1 2 2
5) tapply
do.call("rbind", tapply(1:nrow(DF), DF$Group, function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1))))[unique(DF$Group), ]
6) reshape
with(reshape(DF, dir = "long", varying = list(2:5)),
table(factor(Group, unique(DF$Group)), factor(A, 4:1)))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
Note 1: (1a), (2), (3), (5) and (6) produce a matrix or table result with groups as row names. If you prefer a data frame with Groups as a column then supposing that m is the matrix, add this:
data.frame(Group = rownames(m), m, check.names = FALSE)
Note 2: The input DF in reproducible form is:
Lines <- "Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4"
DF <- read.table(text = Lines, header = TRUE, na.strings = "Na")
We can use dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate_each(funs(replace(., .=="Na", NA))) %>%
gather(Var, Val, A:D, na.rm=TRUE) %>%
group_by(Group, Val) %>%
tally() %>%
spread(Val, n)
# Group `1` `2` `3` `4`
#* <chr> <int> <int> <int> <int>
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1

Subsetting a Data Table using %in%

A stylized version of my data.table is
outmat <- data.table(merge(merge(1:5, 1:5, all=TRUE), 1:5, all=TRUE))
What I would like to do is select a subset of rows from this data.table based on whether the value in the 1st column is found in any of the other columns (it will be handling matrices of unknown dimension, so I can't just use some sort of "row1 == row2 | row1 == row3"
I wanted to do this using
output[row1 %in% names(output)[-1], ]
but this ends up returning TRUE if the value in row1 is found in any of the rows of row2 or row3, which is not the intended behavior. It there some sort of vectorized version of %in% that will achieve my desired result?
To elaborate, what I want to get is the enumeration of 3-tuples from the set 1:5, drawn with replacement, such that the first value is the same as either the second or third value, something like:
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
...
2 1 2
2 2 1
...
5 5 5
What my code instead gives me is every enumeration of 3-tuples, as it is checking whether the first digit (say, 5), ever appears anywhere in the 2rd or 3rd columns, not simply within the same row.
One option is to construct the expression and evaluate it:
dt = data.table(a = 1:5, b = c(1,2,4,3,1), c = c(4,2,3,2,2), d = 5:1)
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
#4: 4 3 2 2
#5: 5 1 2 1
expr = paste(paste(names(dt)[-1], collapse = paste0(" == ", names(dt)[1], " | ")),
"==", names(dt)[1])
#[1] "b == a | c == a | d == a"
dt[eval(parse(text = expr))]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
Another option is to just loop through and compare the columns:
dt[rowSums(sapply(dt, '==', dt[[1]])) > 1]
# a b c d
#1: 1 1 4 5
#2: 2 2 2 4
#3: 3 4 3 3
library(dplyr)
library(tidyr)
dt %>%
mutate(ID = 1:n() )
gather(variable, value, -first_column, -ID) %>%
filter(first_column == value) %>%
select(ID) %>%
distinct %>%
left_join(dt)

Create counter with multiple variables [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have my data that looks like below:
CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013
I need to create a counter variable, which will be like below:
CustomerID TripDate TripCounter
1 1/3/2013 1
1 1/4/2013 2
1 1/9/2013 3
2 2/1/2013 1
2 2/4/2013 2
3 1/2/2013 1
Tripcounter will be for each customer.
Use ave. Assuming your data.frame is called "mydf":
mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
# CustomerID TripDate counter
# 1 1 1/3/2013 1
# 2 1 1/4/2013 2
# 3 1 1/9/2013 3
# 4 2 2/1/2013 1
# 5 2 2/4/2013 2
# 6 3 1/2/2013 1
For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:
mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
# IDA IDB values .id
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 1 3 2
# 4 b 1 4 1
# 5 b 2 5 1
# 6 b 2 6 2
# 7 b 2 7 3
As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.
You can also use plyr for this (using #AnadaMahto's example data):
> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 2 2 2
3 a 1 3 3
4 b 1 4 1
5 b 2 5 2
6 b 2 6 3
7 b 2 7 4
or even:
> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 1 3 2
3 a 2 2 1
4 b 1 4 1
5 b 2 5 1
6 b 2 6 2
7 b 2 7 3
Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.
Here's a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
# IDA IDB values .id
# 1: a 1 1 1
# 2: a 2 2 1
# 3: a 1 3 2
# 4: b 1 4 1
# 5: b 2 5 1
# 6: b 2 6 2
# 7: b 2 7 3
meanwhile, you can also use dplyr. if your data.frame is called mydata
library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.
idCounter <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
mydf$TripCounter <- idCounter(mydf$CustomerID)
Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong
x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
y[i] <- count
}
dataframe$counter <- y
This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating.
a<-read.table(textConnection(
"CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013 "), header=TRUE)
a <- a %>%
group_by(CustomerID,TripDate) # must in order
res <- rep(1, nrow(a)) #base # 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Resources