R: mixedsort on multiple vectors (columns) - r

This is a follow-up on this question, which was marked as a duplicate to this, but the suggested solution does not work.
I have the following data.frame:
set.seed(1)
mydf <- data.frame(A=paste(sample(LETTERS, 4), sample(1:20, 20), sep=""),
B=paste(sample(1:20, 20), sample(LETTERS, 4), sep=""),
C=sample(LETTERS, 20), D=sample(1:100, 20), value=rnorm(20))
> mydf
A B C D value
1 G5 6N T 9 -0.68875569
2 J18 8T R 87 -0.70749516
3 N19 1A L 34 0.36458196
4 U12 7K Z 82 0.76853292
5 G11 14N J 98 -0.11234621
6 J1 20T F 32 0.88110773
7 N3 17A B 45 0.39810588
8 U14 19K W 83 -0.61202639
9 G9 15N U 80 0.34111969
10 J20 3T I 36 -1.12936310
11 N8 9A K 70 1.43302370
12 U16 16K G 86 1.98039990
13 G6 10N M 39 -0.36722148
14 J7 18T D 62 -1.04413463
15 N13 5A Y 35 0.56971963
16 U4 11K N 28 -0.13505460
17 G17 4N O 64 2.40161776
18 J15 2T C 17 -0.03924000
19 N2 12A P 59 0.68973936
20 U10 13K X 10 0.02800216
I want to order it according to columns A to D, but A and D are mixed, so natural order is required.
I know I can apply regular ordering, like:
mydf2 <- mydf[do.call(order, c(mydf[1:4], list(decreasing = FALSE))),]
> mydf2
A B C D value
5 G11 14N J 98 -0.11234621
17 G17 4N O 64 2.40161776
1 G5 6N T 9 -0.68875569
13 G6 10N M 39 -0.36722148
9 G9 15N U 80 0.34111969
6 J1 20T F 32 0.88110773
18 J15 2T C 17 -0.03924000
2 J18 8T R 87 -0.70749516
10 J20 3T I 36 -1.12936310
14 J7 18T D 62 -1.04413463
15 N13 5A Y 35 0.56971963
3 N19 1A L 34 0.36458196
19 N2 12A P 59 0.68973936
7 N3 17A B 45 0.39810588
11 N8 9A K 70 1.43302370
20 U10 13K X 10 0.02800216
4 U12 7K Z 82 0.76853292
8 U14 19K W 83 -0.61202639
12 U16 16K G 86 1.98039990
16 U4 11K N 28 -0.13505460
But this is not the result I need. I need 10 after 9, not after 1 (you can check column A to see it is not in the order I need.)
In the comments of my original question, it was suggested to use the multi.mixedorder function.
However, as you can see below, the result is identical to the one using just order, which is still not what I want.
multi.mixedorder <- function(..., na.last = TRUE, decreasing = FALSE){
do.call(order, c(
lapply(list(...), function(l){
if(is.character(l)){
factor(l, levels=mixedsort(unique(l)))
} else {
l
}
}),
list(na.last = na.last, decreasing = decreasing)
))
}
mydf3 <- mydf[do.call(multi.mixedorder, c(mydf[1:4], list(decreasing = FALSE))),]
> mydf3
A B C D value
5 G11 14N J 98 -0.11234621
17 G17 4N O 64 2.40161776
1 G5 6N T 9 -0.68875569
13 G6 10N M 39 -0.36722148
9 G9 15N U 80 0.34111969
6 J1 20T F 32 0.88110773
18 J15 2T C 17 -0.03924000
2 J18 8T R 87 -0.70749516
10 J20 3T I 36 -1.12936310
14 J7 18T D 62 -1.04413463
15 N13 5A Y 35 0.56971963
3 N19 1A L 34 0.36458196
19 N2 12A P 59 0.68973936
7 N3 17A B 45 0.39810588
11 N8 9A K 70 1.43302370
20 U10 13K X 10 0.02800216
4 U12 7K Z 82 0.76853292
8 U14 19K W 83 -0.61202639
12 U16 16K G 86 1.98039990
16 U4 11K N 28 -0.13505460

OK solved it, the multi.mixedsort function needs a fix to be able to deal with factors:
multi.mixedorder <- function(..., na.last = TRUE, decreasing = FALSE){
do.call(order, c(
lapply(list(...), function(l){
if(is.character(l)){
factor(l, levels=mixedsort(unique(l)))
} else {
factor(as.character(l), levels=mixedsort(levels(l)))
}
}),
list(na.last = na.last, decreasing = decreasing)
))
}
Otherwise convert all factor columns in mydf into character, with:
mydf[] <- lapply(mydf, as.character)
but with the fix, this shouldn't be needed

Related

Put the first row as the column names of my dataframe with dplyr in R

This is my dataframe:
x<-data.frame(A = c(letters[1:10]), M1 = c(11:20), M2 = c(31:40), M3 = c(41:50))
colnames(x)<-NULL
I want to tranpose (t(x)) and consider the first column of x as the colnames of the new dataframe t(x).
Also I need them (the colnames of t(x)) to be identified as words/letters (as character right?)
Is it possible to do this with dplyr package?
Any help?
The {janitor} package is good for this and is flexible enough to be able to select any row to push to column names:
library(tidyverse)
library(janitor)
x <- x %>% row_to_names(row_number = 1)
You can do this easily in base R. Just make the first column of x be the row names, then remove the first column and transpose.
row.names(x) = x[,1]
x = t(x[,-1])
x
a b c d e f g h i j
M1 11 12 13 14 15 16 17 18 19 20
M2 31 32 33 34 35 36 37 38 39 40
M3 41 42 43 44 45 46 47 48 49 50
Try this:
library(dplyr)
library(tidyr)
x <- data.frame(
A = c(letters[1:10]),
M1 = c(11:20),
M2 = c(31:40),
M3 = c(41:50))
x %>%
gather(key = key, value = value, 2:ncol(x)) %>%
spread(key = names(x)[1], value = "value")
key a b c d e f g h i j
1 M1 11 12 13 14 15 16 17 18 19 20
2 M2 31 32 33 34 35 36 37 38 39 40
3 M3 41 42 43 44 45 46 47 48 49 50
I think column_to_rownames from the tibble package would be your simplest solution. Use it before you transpose with t.
library(magrittr)
library(tibble)
x %>%
column_to_rownames("A") %>%
t
#> a b c d e f g h i j
#> M1 11 12 13 14 15 16 17 18 19 20
#> M2 31 32 33 34 35 36 37 38 39 40
#> M3 41 42 43 44 45 46 47 48 49 50
The "M1", "M2", "M3" above are row names. If you want to keep them inside (as a column), you can add rownames_to_column from the same package.
x %>%
column_to_rownames("A") %>%
t %>%
as.data.frame %>%
rownames_to_column("key")
#> key a b c d e f g h i j
#> 1 M1 11 12 13 14 15 16 17 18 19 20
#> 2 M2 31 32 33 34 35 36 37 38 39 40
#> 3 M3 41 42 43 44 45 46 47 48 49 50
Essentially,
column_to_rownames("A") converts column "A" in x to row names,
t transposes the data.frame (now a matrix),
as.data.frame reclassifies it back as a data.frame (which is necessary for the next function), and
rownames_to_column("key") converts the row names into a new column called "key".
Using rownames_to_column() from the tibble package
library(magrittr)
library(tibble)
x %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
`colnames<-`(.[1,]) %>%
.[-1,] %>%
`rownames<-`(NULL)
#> A a b c d e f g h i j
#> 1 M1 11 12 13 14 15 16 17 18 19 20
#> 2 M2 31 32 33 34 35 36 37 38 39 40
#> 3 M3 41 42 43 44 45 46 47 48 49 50
x %>%
`row.names<-`(.[, 1]) %>%
t() %>%
as.data.frame(stringsAsFactors = FALSE) %>%
.[-1,]
#> a b c d e f g h i j
#> M1 11 12 13 14 15 16 17 18 19 20
#> M2 31 32 33 34 35 36 37 38 39 40
#> M3 41 42 43 44 45 46 47 48 49 50
Created on 2018-10-06 by the reprex package (v0.2.1.9000)

Filter data frame by results from tapply function

I'm trying to apply a tapply function I wrote to filter a dataset. Here is a sample data frame (df) below to describe what I'm trying to do.
I want to keep in my data frame the rows where the value of df$Cumulative_Time is closest to the value of 14. It should do this for each factor level in df$ID (keep row closest the value 14 for each ID factor).
ID Date Results TimeDiff Cumulative_Time
A 7/10/2015 71 0 0
A 8/1/2015 45 20 20
A 8/22/2015 0 18 38
A 9/12/2015 79 17 55
A 10/13/2015 44 26 81
A 11/27/2015 98 37 118
B 7/3/2015 75 0 0
B 7/24/2015 63 18 18
B 8/21/2015 98 24 42
B 9/26/2015 70 30 72
C 8/15/2015 77 0 0
C 9/2/2015 69 15 15
C 9/4/2015 49 2 17
C 9/8/2015 88 2 19
C 9/12/2015 41 4 23
C 9/19/2015 35 6 29
C 10/10/2015 33 18 47
C 10/14/2015 31 3 50
D 7/2/2015 83 0 0
D 7/28/2015 82 22 22
D 8/27/2015 100 26 48
D 9/17/2015 19 17 65
D 10/8/2015 30 18 83
D 12/9/2015 96 51 134
D 1/6/2016 30 20 154
D 2/17/2016 32 36 190
D 3/19/2016 42 27 217
I got as far as the following:
spec_day = 14 # value I want to compare df$Cumulative_Time to
# applying function to calculate closest value to spec_day
tapply(df$Cumulative_Time, df$ID, function(x) which(abs(x - spec_day) == min(abs(x - spec_day))))
Question: how do I include this tapply function as a means to do the filtering of my data frame df? Am I approaching this problem the right way, or is there some simpler way to accomplish this that I'm not seeing? Any help would be appreciated--thanks!
Here's a way you can do it, note that I didn't use tapply:
spec_day <- 14
new_df <- do.call('rbind',
by(df, df$ID,
FUN = function(x) x[which.min(abs(x$Cumulative_Time - spec_day)), ]
))
new_df
ID Date Results TimeDiff Cumulative_Time
A A 8/1/2015 45 20 20
B B 7/24/2015 63 18 18
C C 9/2/2015 69 15 15
D D 7/28/2015 82 22 22
which.min (and its sibling which.max) is a very useful function.
Here's a more concise and faster alternative using data.table:
library(data.table)
setDT(df)[, .SD[which.min(abs(Cumulative_Time - 14))], by = ID]
# ID Date Results TimeDiff Cumulative_Time
#1: A 8/1/2015 45 20 20
#2: B 7/24/2015 63 18 18
#3: C 9/2/2015 69 15 15
#4: D 7/28/2015 82 22 22

R: Scraping the hover text title of every cell in a table using rvest

I'm using rvest to scrape data from some javascript tables such as the one here. https://tipping.portadelaidefc.com.au/comp/the-alberton-crowd
As you can see, in this table each cell has a value and also when you hover your mouse over, has another attached value.
I have no problem scraping the table using rvest like so:
tips <- read_html("https://tipping.portadelaidefc.com.au/comp/the-alberton-crowd") %>%
html_table(header=TRUE)
But I'm not sure how to scrape the hover values. Can I do that with rvest?
You could extend/modify the rvest:::html_table.xml_node method as follows:
see R comments
my_html_table <- function(x, header = NA, trim = TRUE, fill = FALSE, dec = ".", attr_name = ""){
rows <- html_nodes(x, "tr")
n <- length(rows)
cells <- lapply(rows, "html_nodes", xpath = ".//td|.//th")
ncols <- lapply(cells, html_attr, "colspan", default = "1")
ncols <- lapply(ncols, as.integer)
p <- unique(vapply(ncols, sum, integer(1)))
if (length(p) > 1) {
if (!fill) {
stop("Table has inconsistent number of columns. ",
"Do you want fill = TRUE?", call. = FALSE)
}
else {
p <- max(p)
}
}
#############################
## The following line is the only one that was changed
#############################
values <- lapply(cells, html_attr, attr_name)
# insted of
# values <- lapply(cells, html_text, trim = trim)
out <- matrix(NA_character_, nrow = n, ncol = p)
for (i in seq_len(n)) {
row <- values[[i]]
ncol <- ncols[[i]]
col <- 1
for (j in seq_len(p)) {
if (j > length(row))
next
out[i, col] <- row[[j]]
col <- col + ncol[j]
}
}
if (is.na(header)) {
header <- all(html_name(cells[[1]]) == "th")
}
if (header) {
col_names <- out[1, , drop = FALSE]
out <- out[-1, , drop = FALSE]
} else {
col_names <- paste0("X", seq_len(ncol(out)))
}
df <- lapply(seq_len(p), function(i) {
utils::type.convert(out[, i], as.is = TRUE, dec = dec)
})
names(df) <- col_names
class(df) <- "data.frame"
attr(df, "row.names") <- .set_row_names(length(df[[1]]))
df
}
now you can do something like this to "merge" the original data with the tooltip data
require(rvest)
doc <- read_html("https://tipping.portadelaidefc.com.au/comp/the-alberton-crowd")
x <- doc %>% html_node("table")
tips <- x %>% my_html_table(attr_name = "title")
dat <- doc %>% html_table() %>% `[[`(1)
tips[,c(1,2,28,29)] <- dat[,c(1,2,28,29)]
tips[1:2,] <- dat[1:2,]
which gives you
> head(tips)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21
1 Leaderboard Select Round: <NA> NA Overall 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2 Rank Name NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 1 Gilly2311 NA <NA> 11 53 42 72 51 41 5 45 20 75 39 3 41 18 13 7 58
4 2 Harts NA <NA> 27 50 56 57 53 28 15 32 8 63 51 1 26 23 21 6 87
5 3 mygypsyrose NA <NA> 3 49 64 62 35 61 1 37 20 47 53 9 23 0 7 15 38
6 4 Scraggie_93 NA <NA> 19 58 81 32 39 31 12 54 35 44 44 17 25 15 7 16 38
X22 X23 X24 X25 X26 X27 X28 X29
1 18 19 20 21 22 23
2 NA NA NA NA NA NA Margin Score
3 74 52 11 62 80 15 888 146
4 46 63 18 51 79 4 865 145
5 48 56 19 12 78 11 748 142
6 48 53 7 21 75 11 782 142
have a look at tips and dat to see their results
Or do it with a bit less effort:
library(rvest)
library(dplyr)
pg <- read_html("https://tipping.portadelaidefc.com.au/comp/the-alberton-crowd")
tips <- html_table(pg, header=TRUE)[[1]]
bind_rows(lapply(html_nodes(pg, "tbody > tr"), function(x) {
cbind.data.frame(t(c(html_text(html_nodes(x, "td"))[2],
html_attr(html_nodes(x, "td.tooltip"), "title"))))
}))
## 1 2 3 4 5 6 7 8 9 10 11 12
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Gilly2311 11 53 42 72 51 41 5 45 20 75 39
## 2 Harts 27 50 56 57 53 28 15 32 8 63 51
## 3 mygypsyrose 3 49 64 62 35 61 1 37 20 47 53
## 4 Scraggie_93 19 58 81 32 39 31 12 54 35 44 44
## 5 Deb1967 4 33 54 60 35 72 21 53 20 118 66
## 6 svolaris 6 52 45 76 50 24 19 45 19 64 58
## 7 dazza power 14 56 61 45 62 54 2 64 60 40 36
## 8 Flamingoflames 28 33 35 83 34 76 1 17 9 83 46
## 9 FEARTHEBEARD 27 34 47 59 50 59 6 50 5 54 38
## 10 Jules23 11 35 57 47 42 65 34 38 4 61 37
## .. ... ... ... ... ... ... ... ... ... ... ... ...
## Variables not shown: 13 (chr), 14 (chr), 15 (chr), 16 (chr), 17 (chr), 18 (chr), 19
## (chr), 20 (chr), 21 (chr), 22 (chr), 23 (chr), 24 (chr)
renaming the columns, changing column type and cbinding it to the main table is also really straightforward.

Fast creation of data.frame

Is there a way to create a data.frame faster or smarter than the one I made below?
df <- data.frame(ID = rep(c("WT", "KO"), each = 4),
Time = rep(c("A", "B", "C", "D"), times = 2),
replicate(5,sample(0:100,8,rep=TRUE)))
colnames(df)<-c("ID", "Time", c("a", "b", "c", "d", "e"))
The data.frame should still look like this
df
ID Time a b c d e
WT A 28 56 50 60 15
WT B 54 77 11 67 34
WT C 53 8 87 62 55
WT D 30 73 47 82 1
KO A 24 83 14 17 36
KO B 91 83 72 41 4
KO C 79 17 76 21 54
KO D 41 40 77 49 92
Thanks
You can just use expand.grid for the non numeric unique combinations (sometimes you can even make use of built it data sets such LETTERS) and run sample only once while wrapping it up into a matrix, something like
set.seed(123)
data.frame(expand.grid(c("WT", "KO"), LETTERS[1:4]),
matrix(sample(40), ncol = 5))
# Var1 Var2 X1 X2 X3 X4 X5
# 1 WT A 12 36 6 11 24
# 2 KO A 31 15 1 27 13
# 3 WT B 16 29 8 22 25
# 4 KO B 33 14 21 28 26
# 5 WT C 34 19 32 4 20
# 6 KO C 2 38 37 35 7
# 7 WT D 18 3 40 10 5
# 8 KO D 30 23 17 9 39
For less specific cases, I would recommend looking into #TylerRinkers wakefield package which allows you to generate random data sets easily.
Just for general information, using data.table v 1.9.5+ you can now set new column names by reference using setnames. For, example if your new data set is called res, one could simply do
library(data.table) # v1.9.5+
setnames(res, c("ID", "Time", letters[1:5]))

R how to take top values by group until its sum exceeds a value

i have a data frame like this
set.seed(500)
df=data.frame(group=c(rep("A",20),rep("B",20),rep("C",20),rep("D",20)),value=round(runif(80,min=1,max=100)))
for each group i want to take the top value rows until their sum exceed/meet the target value
target=data.frame(group=c("A","B","C","D"),value=c(1000,400,500,300))
and output the new groups as 4 data frames.
I sorted them from biggest to smallest
df=df[with(df, order(group,-value)), ]
the desired output is
group value
a 98
a 93
...
a (sum from 98 to here, the group a subtotal should exceed 1000)
b 93
...
c 99
What's the best way of doing this?
Thanks.
You could also do: (Using the ordered df)
indx <- rep(target$value, table(df$group))
val1 <- with(df, ave(value, group, FUN=cumsum))
df[val1 <=indx,]
# group value
#3 A 98
#8 A 93
#12 A 89
#1 A 84
#9 A 83
#5 A 81
#13 A 77
#2 A 73
#15 A 73
#10 A 71
#18 A 62
#19 A 61
#7 A 52
#39 B 93
#28 B 90
#36 B 84
#37 B 83
#52 C 99
#59 C 96
#45 C 86
#43 C 84
#58 C 81
#65 D 93
#75 D 87
#63 D 85
Or using data.table on the ordered df
library(data.table)
setkey(setDT(df), group)
setkey(setDT(target), group)
DT1 <- df[df[target, value1:= i.value][,
cumsum(value) <value1, by=group]$V1, 1:2, with=FALSE]
Update
I guess you wanted something like this:
indx2 <- which(val1 <=indx)
indx3 <- unname(tapply(indx2,cumsum(c(TRUE,diff(indx2)!=1)), tail,1)+1)
df1 <- df[sort(c(indx2,indx3)),]
tapply(df1$value, df1$group, FUN=sum)
# A B C D
#1048 432 518 342
This splits and limits the items in the dataframe. The next one-liner will pick the last row:
> lapply( split(df, df[[1]] ) , function(d) d[ cumsum( d[[2]]) < 200 , ] )
$A
group value
1 A 84
2 A 73
$B
group value
21 B 9
22 B 81
23 B 5
24 B 54
25 B 28
$C
group value
41 C 20
42 C 43
43 C 84
44 C 49
$D
group value
61 D 4
62 D 77
63 D 85
Then use tail
> lapply( split(df, df[[1]] ) , function(d) tail( d[ cumsum( d[[2]]) < 200 , ] ,1))
$A
group value
2 A 73
$B
group value
25 B 28
$C
group value
44 C 49
$D
group value
63 D 85
And if you want to pick the "largest values" then order the dataframe before doing the summation:
> lapply( split(df[order(df[[2]], decreasing=TRUE), ] , df[[1]] ) , function(d) tail( d[ cumsum( d[[2]]) < 200 , ] ,1))
$A
group value
3 A 98
$B
group value
62 D 77
$C
group value
71 D 34
$D
group value
74 D 2
If i understand correctly, you want to the the largest values from each group until the sum of all thsoe values exceeds a certain threshold. If so, I think this code will do that
newdfs<-Map(function(d, m) {
d <-d[order(-d$value), ]
d[cumsum(d$value) < m, ]
}, split(df, df$group), target$value[match(levels(df$group), target$group)])
newdfs
This reurns the data.frames in a list which is almost certainyl better than creating a bunch of new data.frames. If you wanted to merge the results into a single data.frame, you could do
do.call(rbind, newdfs)
to get
group value
A.3 A 98
A.8 A 93
A.12 A 89
A.1 A 84
A.9 A 83
A.5 A 81
A.13 A 77
A.2 A 73
A.15 A 73
A.10 A 71
A.18 A 62
A.19 A 61
A.7 A 52
B.39 B 93
B.28 B 90
B.36 B 84
B.37 B 83
C.52 C 99
C.59 C 96
C.45 C 86
C.43 C 84
C.58 C 81
D.65 D 93
D.75 D 87
D.63 D 85

Resources