How to write ifelse statement with multiple conditions in R? - r

I have a problem in writing ifelse statement ,I have three columns as shown below:
Team 1 Winner
T1 T1
T2 T1
T2 NA
T3 NA
I want another column : Result such that if Team=Winner it should be Winner else losser and If Team=anything & winner=NA then it should be no result...
Team 1 Winner result
T1 T1 winner
T2 T1 losser
T2 NA noresult
T3 NA noresult
Any help would be appreciated.

Another possibility is with case_when from dplyr:
library(dplyr)
df %>%
mutate(Result = case_when(
Team == Winner ~ "Winner",
Team != Winner ~ "Loser",
is.na(Winner) ~ "No result"
))
# Team Winner Result
# 1 T1 T1 Winner
# 2 T2 T1 Loser
# 3 T2 <NA> No result
# 4 T3 <NA> No result
Data:
tt <- "Team Winner
T1 T1
T2 T1
T2 NA
T3 NA"
df <- read.table(text=tt, header = T, stringsAsFactors = F)

You can use dplyr::if_else(), as I learned, it is strict, because it checks the data type and it handles the NAs, making code simpler:
df %>% mutate(Result = if_else( Team==Winner, "Winner", "Loser", missing ='No result'))
Team Winner Result
1 T1 T1 Winner
2 T2 T1 Loser
3 T2 <NA> No result
4 T3 <NA> No result
Despite, looking at the one-liner solution here, for your example data, it's not the fastest (the winner is the #Tim Biegeleisen 's answer, +1):
Unit: microseconds
expr min lq mean median uq max neval cld
IF_ELSE 893.013 974.5060 1176.35331 1053.2260 1343.3590 2278.398 100 b
IFELSE 20.481 34.3475 49.57934 47.3605 58.0275 143.361 100 a
CASE 1067.946 1152.4255 1423.41426 1226.0255 1721.3850 4108.795 100 c
So I can figure out a trade off between simplicity (that is subjective, of course) and more control (that is objective, due the nature of the functions), and velocity (if it's an issue to you, looking your real data, but it's more objective).

Use -
df$Winner <- factor(df[,2], levels=unique(df$Team.1)) # avoid "level sets of factors are different" error
df$result <- ifelse(df$Team.1 == df$Winner, "winner", "loser")
df[is.na(df$result), "result"] <- "noresult"
df
Output
Team.1 Winner result
1 T1 T1 winner
2 T2 T1 loser
3 T2 <NA> noresult
4 T3 <NA> noresult

Try this logic:
df$result <- ifelse(is.na(df$Winner), "no result",
ifelse(df$Team==df$Winner, "winner", "loser"))
df
Team Winner result
1 T1 T1 winner
2 T2 T1 loser
3 T2 <NA> no result
4 T3 <NA> no result

Related

Split a list column into multiple columns

I have a data frame where the last column is a column of lists. Below is how it looks:
Col1 | Col2 | ListCol
--------------------------
na | na | [obj1, obj2]
na | na | [obj1, obj2]
na | na | [obj1, obj2]
What I want is
Col1 | Col2 | Col3 | Col4
--------------------------
na | na | obj1 | obj2
na | na | obj1 | obj2
na | na | obj1 | obj2
I know that all the lists have the same amount of elements.
Edit:
Every element in ListCol is a list with two elements.
Currently, the tidyverse answer would be:
library(dplyr)
library(tidyr)
data %>% unnest_wider(ListCol)
Here is one approach, using unnest and tidyr::spread...
library(dplyr)
library(tidyr)
#example df
df <- tibble(a=c(1, 2, 3), b=list(c(2, 3), c(4, 5), c(6, 7)))
df %>% unnest(b) %>%
group_by(a) %>%
mutate(col=seq_along(a)) %>% #add a column indicator
spread(key=col, value=b)
a `1` `2`
<dbl> <dbl> <dbl>
1 1. 2. 3.
2 2. 4. 5.
3 3. 6. 7.
Comparison of two great answers
There are two great one liner suggestions in this thread:
(1) cbind(df[1], t(data.frame(df$b)))
This is from #Onyambu using base R. To get to this answer one needs to know that a dataframe is a list and needs a bit of creativity.
(2) df %>% unnest_wider(b)
This is from #iago using tidyverse. You need extra packages and to know all the nest verbs, but one can think that it is more readable.
Now let's compare performance
library(dplyr)
library(tidyr)
library(purrr)
library(microbenchmark)
N <- 100
df <- tibble(a = 1:N, b = map2(1:N, 1:N, c))
tidy_foo <- function() suppressMessages(df %>% unnest_wider(b))
base_foo <- function() cbind(df[1],t(data.frame(df$b))) %>% as_tibble # To be fair
microbenchmark(tidy_foo(), base_foo())
Unit: milliseconds
expr min lq mean median uq max neval
tidy_foo() 102.4388 108.27655 111.99571 109.39410 113.1377 194.2122 100
base_foo() 4.5048 4.71365 5.41841 4.92275 5.2519 13.1042 100
Aouch!
base R solution is 20 times faster.
Here's an option with data.table and base::unlist.
library(data.table)
DT <- data.table(a = list(1, 2, 3),
b = list(list(1, 2),
list(2, 1),
list(1, 1)))
for (i in 1:nrow(DT)) {
set(
DT,
i = i,
j = c('b1', 'b2'),
value = unlist(DT[i][['b']], recursive = FALSE)
)
}
DT
This requires a for loop on every row... Not ideal and very anti-data.table.
I wonder if there's some way to avoid creating the list column in the first place...
#Alec data.table offers tstrsplit function to split a column into multiple columns.
DT = data.table(x=c("A/B", "A", "B"), y=1:3)
DT[]
# x y
#1: A/B 1
#2: A 2
#3: B 3
DT[, c("c1") := tstrsplit(x, "/", fixed=TRUE, keep=1L)][] # keep only first
# x y c1
#1: A/B 1 A
#2: A 2 A
#3: B 3 B
DT[, c("c1", "c2") := tstrsplit(x, "/", fixed=TRUE)][]
# x y c1 c2
#1: A/B 1 A B
#2: A 2 A <NA>
#3: B 3 B <NA>

convert to wide format and set 0 if value does not exists

I have following dataset:
dataset1 <- data.frame(
bnames = c("T1", "T1", "T2", "T3", "T3"),
events = c("I", "O", "I", "I", "O"),
freq = c(1,2,3,4,5))
I want to convert this dataset to wide format, my approach (using reshape package):
dataset2 <- melt(dataset1, id.vars = c("bnames", "events"))
dataset2 <- dataset2[c("bnames", "events", "value")]
names(dataset2) <- c("bnames", "events", "freq")
content of dataset2:
bnames events freq
1 T1 I 1
2 T1 O 2
3 T2 I 3
4 T3 I 4
5 T3 O 5
But there should always be two rows with the same name under bnames column. One row with I and another with O under events column. If the corresponding value does not exists in original dataset (dataset1) then the value under freq should always be 0. So my desired result in this case should be:
bnames events freq
1 T1 I 1
2 T1 O 2
3 T2 I 3
4 T2 O 0
5 T3 I 4
6 T3 O 5
How to do this? Thanks
Here's one way in base R:
left_hand <- expand.grid(
bnames = unique(dataset1$bnames),
events = c("I", "O"),
stringsAsFactors = FALSE
)
dataset2 <- merge(left_hand, dataset2, all.x = TRUE)
dataset2[is.na(dataset2)] <- 0
Alternatively, there is a one-liner in tidyr package:
tidyr::complete(dataset2, bnames, events, fill = list(freq = 0))
Here is a data.table solution. Generate all possible permutations of bnames and events, left join this set with original dataset, and return the frequency if available else set to 0.
library(data.table)
setDT(dataset1)[CJ(bnames=bnames, events=events, unique=TRUE),
.(freq=ifelse(is.na(freq), 0, freq)),
by=.EACHI,
on=.(bnames, events)]
# bnames events freq
#1: T1 I 1
#2: T1 O 2
#3: T2 I 3
#4: T2 O 0
#5: T3 I 4
#6: T3 O 5

Merging 3 dataframes Left join

I have 3 dataframes with unequal rows
df1-
T1 T2 T3
1 Joe TTT
2 PP YYY
3 JJ QQQ
5 UU OOO
6 OO GGG
df2
X1 X2
1 09/20/2017
2 08/02/2015
3 05/02/2000
8 06/03/1999
df3
L1 L2
1 New
6 Notsure
9 Also
The final dataframe should be like a left join of all 3 only retaining rows of df1. The matching rows are T1, X1 and L1 but with different header names. The number of rows are different in each dataframe. I couldn't find a solution for this situation. On SO, what i found was available for 2 dataframes or 3 dataframes with equal rows or same column name
T1 T2 T3 X2 L2
1 Joe TTT 09/20/2017 New
2 PP YYY 08/02/2015 NA
3 JJ QQQ 05/02/2000 NA
5 UU OOO NA NA
6 OO GGG NA NotSure
I am comparatively new in R, and couldn't find a R code for this
The idea is to put your data frames in a list, change the name of the first column, and use Reduce to merge, i.e.
Reduce(function(...) merge(..., by = 'Var1', all.x = TRUE),
lapply( mget(ls(pattern = 'df[0-9]+')), function(i) {names(i)[1] <- 'Var1'; i}))
which gives,
Var1 T2 T3 X2 L2
1 1 Joe TTT 09/20/2017 New
2 2 PP YYY 08/02/2015 Old
3 3 JJ QQQ 05/02/2000 <NA>
4 5 UU OOO <NA> <NA>
5 6 OO GGG <NA> Notsure
using tidyverse functions, you can try:
df1 %>%
left_join(df2, by = c("T1" = "X1")) %>%
left_join(df3, by = c("T1" = "L1"))
which gives:
T1 T2 T3 X2 L2
1 1 Joe TTT 09/20/2017 New
2 2 PP YYY 08/02/2015 <NA>
3 3 JJ QQQ 05/02/2000 <NA>
4 5 UU OOO <NA> <NA>
5 6 OO GGG <NA> Notsure
1) sqldf
library(sqldf)
sqldf("select df1.*, X2, L2
from df1
left join df2 on T1 = X1
left join df3 on T1 = L1")
1a) Although slightly longer this variation can make it easier later when reviewing the code by making it explicit as to which source each column came from. If the data frame names were long you might want to use aliases, e.g. from df1 as a, but here we don't bother since they are short.
sqldf("select df1.*, df2.X2, df3.L2
from df1
left join df2 on df1.T1 = df2.X1
left join df3 on df1.T1 = df3.L1")
2) merge Using repeated merge. No packages used.
Merge <- function(x, y) merge(x, y, by = 1, all.x = TRUE)
Merge(Merge(df1, df2), df3)
2a) This could also be written using a magrittr pipeline like this:
library(magrittr)
df1 %>% Merge(df2) %>% Merge(df3)
2b) Using Reduce we can do the repeated merges like this:
Reduce(Merge, list(df1, df2, df3))
Note: The inputs in reproducible form are:
Lines1 <- "
T1 T2 T3
1 Joe TTT
2 PP YYY
3 JJ QQQ
5 UU OOO
6 OO GGG"
Lines2 <- "
X1 X2
1 09/20/2017
2 08/02/2015
3 05/02/2000
8 06/03/1999"
Lines3 <- "
L1 L2
1 New
6 Notsure
9 Also"
df1 <- read.table(text = Lines1, header = TRUE)
df2 <- read.table(text = Lines2, header = TRUE)
df3 <- read.table(text = Lines3, header = TRUE)
With left_join() it would be something like this
df1 = data.frame(X = c("a", "b", "c"), var1 = c(1,2, 3))
df2 = data.frame(V = c("a", "b", "c"), var2 =c(5,NA, NA) )
df3 = data.frame(Y = c("a", "b", "c"), var3 =c("name", NA, "age") )
# rename
df2 = df2 %>% rename(X = V)
df3 = df3 %>% rename(X = Y)
df = left_join(df1, df2, by = "X") %>%
left_join(., df3, by = "X")
> df
X var1 var2 var3
1 a 1 5 name
2 b 2 NA <NA>
3 c 3 NA age

Selecting only those levels of a factor which appear in each level of other factor

I wan to select only those levels of Trt which appear in each level of Loc (commonly appear in level of Loc for every large data set).
Loc <- rep(paste0("L", 1:2), c(6, 4))
Trt <- c(rep(paste0("T", 1:3), times = 2), rep(paste0("T", 1:2), times = 2))
set.seed(12345)
Y <- c(rnorm(n=5, mean = 50, sd = 5), NA, rnorm(n=4, mean = 50, sd = 5))
df1 <- data.frame(Loc, Trt, Y)
df1
Loc Trt Y
1 L1 T1 52.92764
2 L1 T2 53.54733
3 L1 T3 49.45348
4 L1 T1 47.73251
5 L1 T2 53.02944
6 L1 T3 NA
7 L2 T1 40.91022
8 L2 T2 53.15049
9 L2 T1 48.61908
10 L2 T2 48.57920
Required Output
Loc Trt Y
L1 T1 52.92764
L1 T2 53.54733
L1 T1 47.73251
L1 T2 53.02944
L2 T1 40.91022
L2 T2 53.15049
L2 T1 48.61908
L2 T2 48.57920
This can be achieved using
library(dplyr)
df1 %>% filter(Trt != "T3")
Here I know the patter of appearance. I am looking for more general solution.
Here is another idea with base R. We split Trt based on Loc and use Reduce with intersect to find all common elements. We use those elements to index the original data frame, i.e.
i1 <- Reduce(intersect, split(df1$Trt, df1$Loc))
df1[df1$Trt %in% i1,]
which gives,
Loc Trt Y
1 L1 T1 52.92764
2 L1 T2 53.54733
4 L1 T1 47.73251
5 L1 T2 53.02944
7 L2 T1 40.91022
8 L2 T2 53.15049
9 L2 T1 48.61908
10 L2 T2 48.57920
You are essentially trying to figure out which df1$Trt-values exist in every level of df1$Loc. There are probably some nice ways to do it in dplyr, that I'm not aware of. In base R you could do:
dirty <- lapply( levels(df1$Loc), function(x) df1$Trt[df1$Loc == x])
clean <- do.call(intersect, dirty)
df1[df1$Trt %in% clean, ]
# Loc Trt Y
# 1 L1 T1 52.92764
# 2 L1 T2 53.54733
# 4 L1 T1 47.73251
# 5 L1 T2 53.02944
# 7 L2 T1 40.91022
# 8 L2 T2 53.15049
# 9 L2 T1 48.61908
# 10 L2 T2 48.57920
In the last step you could also stick to your dplyr solution:
df1 %>% filter(Trt %in% clean)
Using data.table, a possible solution is
library(data.table)
setDT(df1)[df1[, uniqueN(Loc), by = Trt][V1 == df1[, uniqueN(Loc)]], on = "Trt"][, -"V1"]
Loc Trt Y
1: L1 T1 52.92764
2: L1 T1 47.73251
3: L2 T1 40.91022
4: L2 T1 48.61908
5: L1 T2 53.54733
6: L1 T2 53.02944
7: L2 T2 53.15049
8: L2 T2 48.57920
Explanation
The total number of unique levels of Loc is
df1[, uniqueN(Loc)]
[1] 2
The number of unique levels of Loc in each Trt is
df1[, uniqueN(Loc), by = Trt]
Trt V1
1: T1 2
2: T2 2
3: T3 1
The levels of Trt which contain all levels of Loc are
df1[, uniqueN(Loc), by = Trt][V1 == df1[, uniqueN(Loc)]]
Trt V1
1: T1 2
2: T2 2
Now, this right joined with df1 and the helper column removed from the result:
df1[df1[, uniqueN(Loc), by = Trt][V1 == df1[, uniqueN(Loc)]], on = "Trt"][, -"V1"]

Compute the number of distinct values in col2 for each distinct value in col1 in R

I have a dataframe like this:
df <- data.frame(
SchoolID=c("A","A","B","B","C","D"),
Country=c("XX","XX","XX","YY","ZZ","ZZ"))
which gives me this data:
SchoolID Country
1 A XX
2 A XX
3 B XX
4 B YY
5 C ZZ
6 D ZZ
I would like to know for each SchoolID whether Country is uniquely assigned, by looking, for each distinct value of SchoolID, the number of distinct values of Country. So I would like to obtain this kind of table:
SchoolID NumberOfCountry
A 1
B 2
C 1
D 1
aggregate(Country ~ SchoolID, df, function(x) length(unique(x)))
Or
tapply(df$Country, df$SchoolID, function(x) length(unique(x)))
Or
library(data.table)
setDT(df)[, .(NumberOfCountry = length(unique(Country))), by = SchoolID]
Or with v >1.9.5
setDT(df)[, .(NumberOfCountry = uniqueN(Country)), by = SchoolID]
Or
library(dplyr)
df %>%
group_by(SchoolID) %>%
summarise(NumberOfCountry = n_distinct(Country))
One approach, which does not rely on third-party libraries:
> as.data.frame(rowSums(table(df[!duplicated(df), ]), na.rm=T))
rowSums(table(df[!duplicated(df), ]), na.rm = T)
A 1
B 2
C 1
D 1
try this..
select School,count(Country)
from(
select distinct School,Country
from tbl_stacko) temp
group by School

Resources