This question already has answers here:
Find max per group and return another column
(4 answers)
Closed 5 years ago.
I have a dataframe df
df = data.frame(L = rep(letters[1:6], each = 2),
M = rep(letters[7:12]),
freq = sample(c(5, 10), replace = FALSE))
L M freq
1 a g 5
2 a h 10
3 b i 5
4 b j 10
5 c k 5
6 c l 10
7 d g 5
8 d h 10
9 e i 5
10 e j 10
11 f k 5
12 f l 10
I want to select the most frequent M for each L.
In this example the output would show:
h, j, l, h, j, l
Frequency is not necessarily every second value in the actual problem.
How can I do this easily?
I've tried a tapply approach, but get stuck here because this seems to only apply to variables and can't be used to subset a subset data frame. (This didn't result in anything close so I won't post the approach)
We can do
library(data.table)
setDT(df)[, .(M = M[which.max(freq)]), L]
# L M
#1: a h
#2: b j
#3: c l
#4: d h
#5: e j
#6: f l
Or order the 'freq' and select the first 'M' for each 'L'
setDT(df)[order(-freq), .(M = M[1]) , L]
Another solution using dplyr
df %>% group_by(L) %>% top_n(1, freq) %>% .$M
#### [1] h j l h j l
eventually transform into character at the end...
Related
I want to add a total row (as in the Excel tables) while writing my data.frame in a worksheet.
Here is my present code (using openxlsx):
writeDataTable(wb=WB, sheet="Data", x=X, withFilter=F, bandedRows=F, firstColumn=T)
X contains a data.frame with 8 character variables and 1 numeric variable. Therefore the total row should only contain total for the numeric row (it will be best if somehow I could add the Excel total row feature, like I did with firstColumn while writing the table to the workbook object rather than to manually add a total row).
I searched for a solution both in StackOverflow and the official openxslx documentation but to no avail. Please suggest solutions using openxlsx.
EDIT:
Adding data sample:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
After Total row:
A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f
na na na na na na na 22 na
library(janitor)
adorn_totals(df, "row")
#> A B C D E F G H I
#> a b s r t i s 5 j
#> f d t y d r s 9 s
#> w s y s u c k 8 f
#> Total - - - - - - 22 -
If you prefer empty space instead of - in the character columns you can specify fill = "" or fill = NA.
Assuming your data is stored in a data.frame called df:
df <- read.table(text =
"A B C D E F G H I
a b s r t i s 5 j
f d t y d r s 9 s
w s y s u c k 8 f",
header = TRUE,
stringsAsFactors = FALSE)
You can create a row using lapply
totals <- lapply(df, function(col) {
ifelse(!any(!is.numeric(col)), sum(col), NA)
})
and add it to df using rbind()
df <- rbind(df, totals)
head(df)
A B C D E F G H I
1 a b s r t i s 5 j
2 f d t y d r s 9 s
3 w s y s u c k 8 f
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> 22 <NA>
I would like to split a data.frame into a list based on row values/characters across all columns of the data.frame.
I wrote lists of data.frames to file using write.list {erer}
So now when I read them in again, they look like this:
dummy data
set.seed(1)
df <- cbind(data.frame(col1=c(sample(LETTERS, 4),"col1",sample(LETTERS, 7))),
data.frame(col2=c(sample(LETTERS, 4),"col2",sample(LETTERS, 7))),
data.frame(col3=c(sample(LETTERS, 4),"col3",sample(LETTERS, 7))))
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
5 col1 col2 col3
6 F M A
7 W R J
8 Y X U
9 P I H
10 N Y K
11 B T M
12 E E Y
And I would like to split into lists by c("col1","col2","col3") producing
[[1]]
col1 col2 col3
1 G E Q
2 J R D
3 N J G
4 U Y I
[[2]]
col1 col2 col3
1 F M A
2 W R J
3 Y X U
4 P I H
5 N Y K
6 B T M
7 E E Y
Feels like it should be straightforward using split, but my attempts so far have failed. Also, as you see, I can't split by a certain row interval.
Any pointers would be highly appreciated, thanks!
Try
lapply(split(d1, cumsum(grepl(names(d1)[1], d1$col1))), function(x) x[!grepl(names(d1)[1], x$col1),])
#$`0`
# col1 col2 col3
#1 G E Q
#2 J R D
#3 N J G
#4 U Y I
#$`1`
# col1 col2 col3
#6 F M A
#7 W R J
#8 Y X U
#9 P I H
#10 N Y K
#11 B T M
#12 E E Y
This should be general, if you want to split if a line is exactly like the colnames:
dfSplit<-split(df,cumsum(Reduce("&",Map("==",df,colnames(df)))))
for (i in 2:length(dfSplit)) dfSplit[[i]]<-dfSplit[[i]][-1,]
The second line can be written a little more R-style as #DavidArenburg suggested in the comments.
dfSplit[-1] <- lapply(dfSplit[-1], function(x) x[-1, ])
It has also the added benefit of doing nothing if dfSplit has length 1 (opposite to my original second line, which would throw an error).
I have a data frame that has one value in each cell, but my last column is a list.
Example. Here there are 3 columns. X and Y columns have one value in each row. But column Z is actually a list. It can have multiple values in each cell.
X Y Z
1 a d h, i, j
2 b e j, k
3 c f l, m, n, o
I need to create this:
X Y Z
1 a d h
2 a d i
3 a d j
4 b e j
4 b e k
5 c f l
6 c f m
7 c f n
8 c f o
Can someone help me figure this out ? I am not sure how to use melt or dcast or any other function for this.
Thanks.
unnest from tidyr works
library(tidyr)
unnest(dat, Z)
I have a data frame. I want to filter out some issues only in the case they are associated with a specific group.
For a dummy example, suppose I have the following:
> mydf
Group Issue
1 A G
2 A H
3 A L
4 B V
5 B M
6 C G
7 C H
8 C L
9 C X
10 D G
11 D H
12 D I
I want to filter out rows with a "G" or "H" or "L" issue if there is also an "L" issue in that Group.
So in this case, I want to filter out rows 1, 2, 3, 6,7,8 but leave rows 4,5,9, 10,11, and 12. Thus the result would be:
> mydf
Group Issue
4 B V
5 B M
9 C X
10 D G
11 D H
12 D I
I think I first need to group_by(Group) but then I'm wondering what's the best way to do this.
Thanks!
If the rule is
When a group contains L, drop L, G & H.
then
mydf %>%
group_by(Group) %>%
filter( if (any(Issue=="L")) !(Issue %in% c("G","H","L")) else TRUE )
# Group Issue
# 1 B V
# 2 B M
# 3 C X
# 4 D G
# 5 D H
# 6 D I
Given:
df <- data.frame(rep = letters[sample(4, 30, replace=TRUE)], loc = LETTERS[sample(5:8, 30, replace=TRUE)], y= rnorm(30))
lookup <- data.frame(rep=letters[1:4], loc=LETTERS[5:8])
This will give me the rows in df that have rep,loc combinations that occur in lookup:
mdply(lookup, function(rep,loc){
r=rep
l=loc
subset(df, rep==r & loc==l)
})
But I've read that using subset() inside a function is poor practice due to scoping issues. So how do I get the desired result using index notation?
In this particular case, merge seems to make the most sense to me:
merge(df, lookup)
# rep loc y
# 1 a E 1.6612394
# 2 a E 1.1050825
# 3 a E -0.7016759
# 4 b F 0.4364568
# 5 d H 1.3246636
# 6 d H -2.2573545
# 7 d H 0.5061980
# 8 d H 0.1397326
A simple alternative might be to paste together the "rep" and "loc" columns from df and from lookup and subset based on that:
df[do.call(paste, df[c("rep", "loc")]) %in% do.call(paste, lookup), ]
# rep loc y
# 4 d H 1.3246636
# 10 b F 0.4364568
# 14 a E -0.7016759
# 15 a E 1.6612394
# 19 d H 0.5061980
# 20 a E 1.1050825
# 22 d H -2.2573545
# 28 d H 0.1397326