Finding maximum value across multiple columns (pmax gives NAs) - r

Using reshape function, I have data that looks something like this:
df1
ID Score1 Score2 Score3
1 2 3 1
2 3 2 1
3 2 1 NA
4 1 NA NA
As you can see, some of my score variables have missing values.
I'm interested in finding the maximum score variable for all ID values. When I tried using pmax(df1$Score1,df1$Score2,df1$Score3), my resulting vector contains NAs. I'm not sure why this is, as I know that my Score1 variable doesn't contain any NAs.
This is what I'd like my output to accomplish:
ID MaxScore
1 3
2 3
3 2
4 1
Thanks

You could use apply on each row (MARGIN = 1)
apply(X = df1[,-1], MARGIN = 1, FUN = max, na.rm = TRUE)
#[1] 3 3 2 1

We can use the vectorized pmax to get this done
cbind(df1['ID'], MaxScore = do.call(pmax, c(df1[-1], na.rm = TRUE)))
# ID MaxScore
#1 1 3
#2 2 3
#3 3 2
#4 4 1

Related

How to sort a dataframe in decreasing order lapply and sort in r

Not sure if this is a duplicate but I couldn't find anything that either solves my original problem or the issue I'm running into with the partial I did find.
The goal is to sort a dataframe independently by column.
Reproducible example
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a
name date1 date2 date3
1 a 2 0 0
2 a 3 2 2
3 a 1 3 0
4 b 3 1 3
5 b 1 2 2
6 b 2 0 1
b <- ddply(a, "name", function(x) { as.data.frame(lapply(x, sort))
b
name date1 date2 date3
1 a 1 0 0
2 a 2 2 0
3 a 3 3 2
4 b 1 0 1
5 b 2 1 2
6 b 3 2 3
Now this works as expected, but is the opposite of what I'm looking to do.
Desired output
b
name date1 date2 date3
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
I've tried to add in the decreasing=T parameter but haven't had any luck with the variations I've tried and usually end up with an error about missing arguments or undefined columns being selected. How does one correctly implement a decreasing sort with this syntax and/or otherwise achieve the end result without relying on explicitly naming the columns (they names are dates so change often)
Bonus
How could this code be adapted to account for NA's with na.last
Thank you!
I think you nuked the data.frame rows with your code, not a very good practice standard dplyr use the arrange() function like this
library(tidyverse)
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a %>%
arrange(name,-date1)
If you want to live a dangerous life here is the code for it
a %>%
group_by(name) %>%
mutate_all(sort,decreasing = TRUE)
name date1 date2 date3
<fct> <dbl> <dbl> <dbl>
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
A solution with the data.table package is the following
library(data.table)
a <- data.table(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# alternatively:
# a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# setDT(a)
b <- a[, lapply(.SD, sort, decreasing = TRUE), by = name]
.SD returns the subset of data, in this case created with the by = name. It splits the original data.table by the values in the given column.
This also fulfills your bonus requirement, the na.last can be supplied.
aa <- data.table(name = c("a","a","a","b","b","b"),date1 = c(NA,3,1,3,1,NA),date2 = c(0,2,NA,1,2,0),date3 = c(0,2,0,3,2,NA))
bb <- aa[, lapply(.SD, sort, decreasing = TRUE, na.last = TRUE), by = name]

How to apply function to colnames in tidyverse

Just like in title: is there any function that allows applying another function to column names of data frame? I mean something like forcats::fct_relabel that applies some function to factor labels.
To give an example, supose I have a data.frame as below:
X<-data.frame(
firababst = c(1,1,1),
secababond = c(2,2,2),
thiababrd = c(3,3,3)
)
X
firababst secababond thiababrd
1 1 2 3
2 1 2 3
3 1 2 3
Now I wish to get rid of abab from column names by applying stringr::str_remove. My workaround involves magrittr::set_colnames:
X %>%
set_colnames(colnames(.) %>% str_remove('abab'))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
Can you suggest some more strightforward way? Ideally, something like:
X %>%
magic_foo(str_remove, 'abab')
You can do:
X %>%
rename_all(~ str_remove(., "abab"))
first second third
1 1 2 3
2 1 2 3
3 1 2 3
With base R, we can do
names(X) <- sub("abab", "", names(X))

Calculate rowwise maximum from columns that have changing names

I have the following objects:
s1 = "1_1_1_1_1"
s2 = "2_1_1_1_1"
s3 = "3_1_1_1_1"
Please note that the value of s1, s2, s3 can change in another example.
I then have the follwoing data frame:
set.seed(666)
df = data.frame(draw = c(1,2,3,4,1,2,3,4,1,2,3,4),
resp = c(1,1,1,1,2,2,2,2,3,3,3,3),
"1_1_1_1_1" = runif(12),
"2_1_1_1_1" = runif(12),
"3_1_1_1_1" = runif(12)).
Please note that the column names of may data frame will change based on the values of s1,s2,s3.
I now want to achieve the following:
I want to find out which of last three columns in df has the highest value and store it as a value in a new column (values are supposed to be either of 1,2 or 3, depending on if the highest value is the first, second or third of these variables).
Now that I know which value is the highest per row, I want to group/summarize the result by the column resp and count how often my max value is 1, 2 or 3.
So the outcome from 1. should be:
draw resp 1_1_1_1_1 2_1_1_1_1 3_1_1_1_1 max
1 1 0.774 0.095 0.806 3
2 1 0.197 0.142 0.266 3
...
And the outcome from 2. is supposed to be:
resp first_max second_max third_max
1 1 1 2
2 2 1 1
3 1 2 1
My problem is that tidyverse's rowwise function is deprecated and that I don't know how I can dynamically address columns in a tidyverse pipe by column names which a re stored externally (here in s1, s2, s3). One last note: I might be overcomplicating things by trying to go by the column names, when, in fact, the positions of the columns that I'm interested in are always at column position 3:5.
Here is one way to get what you want. For a sligthly different format, you can use count rather than table but this matches your expected output. Hope this helps!!
library(dplyr)
df %>%
mutate(max_val = max.col(select(., starts_with("X")))) %>%
select(resp, max_val) %>%
table()
max_val
resp 1 2 3
1 1 1 2
2 2 1 1
3 1 2 1
Or, you could do this:
df %>%
mutate(max_val = max.col(.[3:5])) %>%
count(resp, max_val) %>%
mutate(max_val = paste0("max_", max_val)) %>%
spread(value = n, key = max_val)
resp max_1 max_2 max_3
<dbl> <int> <int> <int>
1 1 1 1 2
2 2 2 1 1
3 3 1 2 1
calculate max using pmap(row-wise iteration)
max_cols <- pmap_dbl(unname(df),function(x,y,...){
vals <- unlist(list(...))
return(which(vals == max(vals)))
})
result <- df %>% add_column(max = max_cols)
> result
draw resp X1_1_1_1_1 X2_1_1_1_1 X3_1_1_1_1 max
1 1 1 0.4551478 0.70061232 0.618439890 2
2 2 1 0.3667764 0.26670969 0.024742605 1
3 3 1 0.6806912 0.03233215 0.004014758 1
4 4 1 0.9117449 0.42926492 0.885247456 1
5 1 2 0.1886954 0.34189707 0.985054492 3
6 2 2 0.5569398 0.78043504 0.100714130 2
7 3 2 0.9791164 0.92823982 0.676584495 1
8 4 2 0.9174654 0.74627116 0.485582287 1
9 1 3 0.3681890 0.69622331 0.672346875 2
10 2 3 0.5510356 0.99651637 0.482430518 2
11 3 3 0.4283281 0.12832611 0.018095649 1
12 4 3 0.6168436 0.64381995 0.655178701 3
Reshape the data frame.
reshape2::dcast(result,resp~max,fun.aggregate = length,value.var = "max")
resp 1 2 3
1 1 1 1 2
2 2 2 1 1
3 3 1 2 1

Summary of values across rows and columns in R

I have a dataset that looks like:
Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4
I want the summary of this dataset across rows and columns for the count of each value as below:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The count of 4 in the dataset for group XYZ across all rows and columns is 1, for 2 and 1 its 2, for 3 its 1. I can do this by creating 4 new columns 4,3,2,1 and getting the count row wise and then column wise, but this is not efficient and scalable. I am sure there is a better way to get this done.
Using reshape2 package we can melt and dcast as follows,
library(reshape2)
dcast(na.omit(melt(df, id.vars = 'Group')), Group ~ value, fun.aggregate = length)
# Group 1 2 3 4
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1
This uses no packages and is just one line. Here DF$Group[row(DF[-1])] is a Group labels vector such that each element corresponds to the unravelled numeric vector unlist(DF[-1]).
table(DF$Group[row(DF[-1])], unlist(DF[-1]))
giving:
1 2 3 4
DEF 3 1 3 1
PQR 2 2 1 1
XYZ 2 2 1 1
If the order of rows and columns shown in the question is important then to we can create factors from each of the two table arguments with the factor levels being defined in the orders desired. In this case we use the following line instead of the line of code above:
table(Group = factor(DF$Group[row(DF[-1])], unique(DF$Group)), factor(unlist(DF[-1]), 4:1))
giving:
Group 4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
The above produces an object of class "table". This is a particularly suitable class for tabulated frequencies. For example, once in this form ftable can be used to easily rearrange it further as in ftable(tab, row.vars = 2) or ftable(tab, row.vars = 1:2) where tab is the above computed table.
If a data.frame were preferred then convert it like this:
cbind(Group = rownames(tab), as.data.frame.matrix(tab))
The input data.frame DF is defined reproducibly in Note 2 at the end.
Alternatives
Although the above seems the most direct here are some other alternatives that also use no packages:
1) by For each set of rows having the same Group value the anonymous function creates a data.frame identifying the Group, converting the columns other than the first to a factor with the indicated levels and running table to get the counts. The "by" list that is returned is sorted back to the original order and we rbind everything back together.
do.call("rbind",
by(DF, DF$Group, function(x) {
data.frame(Group = x[1,1],
as.list(table(factor(unlist(x[, -1]), levels = 4:1))),
check.names = FALSE)
})[unique(DF$Group)])
giving:
Group 4 3 2 1
XYZ XYZ 1 1 2 2
DEF DEF 1 3 1 3
PQR PQR 1 1 2 2
1a) This slightly shorter variation would also work. It returns a matrix identifying the groups using row names.
kount <- function(x) table(factor(unlist(x), levels = 4:1))
m <- do.call("rbind", by(DF[, -1], DF$Group, kount)[unique(DF$Group)])
giving:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
2) outer
gps <- unique(DF$Group)
levs <- 4:1
kount2 <- function(g, lv) sum(subset(DF, Group == g)[-1] == lv, na.rm = TRUE)
m <- outer(gps, levs, Vectorize(kount2))
dimnames(m) <- list(gps, levs))
giving this matrix:
> m
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
3) sapply
kount3 <- function(g) table(factor(unlist(DF[DF$Group == g, -1]), levels = 4:1))
gps <- as.character(unique(DF$Group))
do.call("rbind", sapply(gps, kount3, simplify = FALSE))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
4) aggregate
aggregate(1:nrow(DF), DF["Group"], function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1)))[unique(DF$Group), ]
giving:
Group x.4 x.3 x.2 x.1
3 XYZ 1 1 2 2
1 DEF 1 3 1 3
2 PQR 1 1 2 2
5) tapply
do.call("rbind", tapply(1:nrow(DF), DF$Group, function(ix)
table(factor(unlist(DF[ix, -1]), levels = 4:1))))[unique(DF$Group), ]
6) reshape
with(reshape(DF, dir = "long", varying = list(2:5)),
table(factor(Group, unique(DF$Group)), factor(A, 4:1)))
giving:
4 3 2 1
XYZ 1 1 2 2
DEF 1 3 1 3
PQR 1 1 2 2
Note 1: (1a), (2), (3), (5) and (6) produce a matrix or table result with groups as row names. If you prefer a data frame with Groups as a column then supposing that m is the matrix, add this:
data.frame(Group = rownames(m), m, check.names = FALSE)
Note 2: The input DF in reproducible form is:
Lines <- "Group A B C D
XYZ 4 Na 1 3
XYZ Na 2 2 1
DEF 4 3 2 1
DEF 3 3 1 1
PQR 1 Na Na 1
PQR 3 2 2 4"
DF <- read.table(text = Lines, header = TRUE, na.strings = "Na")
We can use dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate_each(funs(replace(., .=="Na", NA))) %>%
gather(Var, Val, A:D, na.rm=TRUE) %>%
group_by(Group, Val) %>%
tally() %>%
spread(Val, n)
# Group `1` `2` `3` `4`
#* <chr> <int> <int> <int> <int>
#1 DEF 3 1 3 1
#2 PQR 2 2 1 1
#3 XYZ 2 2 1 1

Count occurrences of value in a set of variables in R (per row)

Let's say I have a data frame with 10 numeric variables V1-V10 (columns) and multiple rows (cases).
What I would like R to do is: For each case, give me the number of occurrences of a certain value in a set of variables.
For example the number of occurrences of the numeric value 99 in that single row for V2, V3, V6, which obviously has a minimum of 0 (none of the three have the value 99) and a maximum of 3 (all of the three have the value 99).
I am really looking for an equivalent to the SPSS function COUNT: "COUNT creates a numeric variable that, for each case, counts the occurrences of the same value (or list of values) across a list of variables."
I thought about table() and library plyr's count(), but I cannot really figure it out. Vectorized computation preferred. Thanks a lot!
If you need to count any particular word/letter in the row.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,L),V2=c(1,L,2,2,L),
V3=c(1,2,2,1,L), V4=c(L, L, 1,2, L))
For counting number of L in each row just use
#This is how to compute a new variable counting occurences of "L" in V1-V4.
df$count.L <- apply(df, 1, function(x) length(which(x=="L")))
The result will appear like this
> df
V1 V2 V3 V4 count.L
1 1 1 1 L 1
2 1 L 2 L 2
3 2 2 2 1 0
4 1 2 1 2 0
I think that there ought to be a simpler way to do this, but the best way that I can think of to get a table of counts is to loop (implicitly using sapply) over the unique values in the dataframe.
#Some example data
df <- data.frame(a=c(1,1,2,2,3,9),b=c(1,2,3,2,3,1))
df
# a b
#1 1 1
#2 1 2
#3 2 3
#4 2 2
#5 3 3
#6 9 1
levels=unique(do.call(c,df)) #all unique values in df
out <- sapply(levels,function(x)rowSums(df==x)) #count occurrences of x in each row
colnames(out) <- levels
out
# 1 2 3 9
#[1,] 2 0 0 0
#[2,] 1 1 0 0
#[3,] 0 1 1 0
#[4,] 0 2 0 0
#[5,] 0 0 2 0
#[6,] 1 0 0 1
Try
apply(df,MARGIN=1,table)
Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.
For instance:
df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]
10 20
1 2
[[2]]
10 20 30
1 1 1
[[3]]
10 20
1 2
[[4]]
10 20 30
1 1 1
#desired result
Here is another straightforward solution that comes closest to what the COUNT command in SPSS does — creating a new variable that, for each case (i.e., row) counts the occurrences of a given value or list of values across a list of variables.
#Let df be a data frame with four variables (V1-V4)
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
#This is how to compute a new variable counting occurences of value "1" in V1-V4.
df$count.1 <- apply(df, 1, function(x) length(which(x==1)))
The updated data frame contains the new variable count.1 exactly as the SPSS COUNT command would do.
> df
V1 V2 V3 V4 count.1
1 1 1 1 NA 3
2 1 NA 2 NA 1
3 2 2 2 1 1
4 1 2 1 2 2
5 NA NA NA NA 0
You can do the same to count how many time the value "2" occurs per row in V1-V4. Note that you need to select the columns (variables) in df to which the function is applied.
df$count.2 <- apply(df[1:4], 1, function(x) length(which(x==2)))
You can also apply a similar logic to count the number of missing values in V1-V4.
df$count.na <- apply(df[1:4], 1, function(x) sum(is.na(x)))
The final result should be exactly what you wanted:
> df
V1 V2 V3 V4 count.1 count.2 count.na
1 1 1 1 NA 3 0 1
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 0
4 1 2 1 2 2 2 0
5 NA NA NA NA 0 0 4
This solution can easily be generalized to a range of values.
Suppose we want to count how many times a value of 1 or 2 occurs in V1-V4 per row:
df$count.1or2 <- apply(df[1:4], 1, function(x) sum(x %in% c(1,2)))
A solution with functions from the dplyr package would be the following:
Using the example data set from LechAttacks answer:
df <- data.frame(V1=c(1,1,2,1,NA),V2=c(1,NA,2,2,NA),
V3=c(1,2,2,1,NA), V4=c(NA, NA, 1,2, NA))
Count the appearances of "1" and "2" each and both combined:
df %>%
rowwise() %>%
mutate(count_1 = sum(c_across(V1:V4) == 1, na.rm = TRUE),
count_2 = sum(c_across(V1:V4) == 2, na.rm = TRUE),
count_12 = sum(c_across(V1:V4) %in% 1:2, na.rm = TRUE)) %>%
ungroup()
which gives the table:
V1 V2 V3 V4 count_1 count_2 count_12
1 1 1 1 NA 3 0 3
2 1 NA 2 NA 1 1 2
3 2 2 2 1 1 3 4
4 1 2 1 2 2 2 4
5 NA NA NA NA 0 0 0
In my effort to find something similar to Count from SPSS in R is as follows:
`df <- data.frame(a=c(1,1,NA,2,3,9),b=c(1,2,3,2,NA,1))` #Dummy data with NAs
`df %>%
dplyr::mutate(count = rowSums( #this allows calculate sum across rows
dplyr::select(., #Slicing on .
dplyr::one_of( #within select use one_of by clarifying which columns your want
c('a','b'))), na.rm = T)) #once the columns are specified, that's all you need, na.rm is cherry on top
That's how the output looks like
>df
a b count
1 1 1 2
2 1 2 3
3 NA 3 3
4 2 2 4
5 3 NA 3
6 9 1 10
Hope it helps :-)

Resources