I'd like to create an efficient ifelse statement such that if columns from df2 match columns from df1, then that row in df2 is coded a specific way. My code works but is very inefficient.
Example data:
Df1
A B C
111 2 1
111 5 2
111 7 3
112 2 4
112 8 5
113 2 6
Df2
A B
112 2
111 2
113 2
111 5
111 7
112 8
Desired Outcome:
Df2
A B C
112 2 4
111 2 1
113 2 6
111 5 2
111 7 3
112 8 5
What I've done is this:
Df2$C<- ifelse(Df2$A == 111 & Df2$B == 2, 1, 0)
Df2$C<- ifelse(Df2$A == 111 & Df2$B == 5, 2, 0)
Df2$C<- ifelse(Df2$A == 111 & Df2$B == 7, 3, 0)...
This works, but is there a way such that df2 could reference the columns in df1 and create column df2$C, so that each combination doesn't have to be manually typed out?
This would typically be done with a join. left_join from dplyr will connect each of the rows in your first table with the each of the matching rows in the second table.
https://dplyr.tidyverse.org/reference/join.html
library(dplyr)
Df2 %>% left_join(Df1)
Joining, by = c("A", "B")
A B C
1 112 2 4
2 111 2 1
3 113 2 6
4 111 5 2
5 111 7 3
6 112 8 5
merge from base R will give a similar result, but doesn't keep the original row order without some extra wrangling.
Merge two data frames while keeping the original row order
merge(Df2, Df1)
A B C
1 111 2 1
2 111 5 2
3 111 7 3
4 112 2 4
5 112 8 5
6 113 2 6
Related
I looked here and elsewhere, but I cannot find something that does exactly what I'm looking to accomplish using R.
I have data similar to below, where col1 is a unique ID, col2 is a group ID variable, col3 is a status code. I need to flag all rows with the same group ID, and where any of those rows have a specific status code, X in this case, as == 1, otherwise 0.
ID GroupID Status Flag
1 100 A 1
2 100 X 1
3 102 A 0
4 102 B 0
5 103 B 1
6 103 X 1
7 104 X 1
8 104 X 1
9 105 A 0
10 105 C 0
I have tried writing some ifelse where groupID == groupID and status == X then 1 else 0, but that doesn't work. The pattern of Status is random. In this example, the GroupID is exclusively pairs, but I don't want to assume that in the code, b/c I have other instance where there are 3 or more rows in a GroupID.
It would be helpful if this were open ended IE I could add other conditions if necessary, like, for each matching group ID, where Status == X, and other or other, etc.
Thank you !
Group-based operations like this are easy to do with the dplyr package.
The data:
library(dplyr)
txt <- 'ID GroupID Status
1 100 A
2 100 X
3 102 A
4 102 B
5 103 B
6 103 X
7 104 X
8 104 X
9 105 A
10 105 C '
df <- read.table(text = txt, header = T)
Once we have the data frame, we establish dplyr groups with the group_by function. The mutate command will then be applied per each group, creating a new column entry for each row.
df.new <- df %>%
group_by(GroupID) %>%
mutate(Flag = as.numeric(any(Status == 'X')))
# A tibble: 10 x 4
# Groups: GroupID [5]
ID GroupID Status Flag
<int> <int> <fct> <dbl>
1 1 100 A 1
2 2 100 X 1
3 3 102 A 0
4 4 102 B 0
5 5 103 B 1
6 6 103 X 1
7 7 104 X 1
8 8 104 X 1
9 9 105 A 0
10 10 105 C 0
From base R
ave(df$Status=='X',df$GroupID,FUN=any)
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE
Data.table way:
library(data.table)
setDT(df)
df[ , flag := sum(Status == "X") > 0, by=GroupID]
An alternative using data.table
library(data.table)
dt <- read.table(stringsAsFactors = FALSE,text = "ID GroupID Status
1 100 A
2 100 X
3 102 A
4 102 B
5 103 B
6 103 X
7 104 X
8 104 X
9 105 A
10 105 C", header=T)
setDT(dt)[,.(ID,Status, Flag=ifelse("X"%in% Status,1,0)),by=GroupID]
#returns
GroupID ID Status Flag
1: 100 1 A 1
2: 100 2 X 1
3: 102 3 A 0
4: 102 4 B 0
5: 103 5 B 1
6: 103 6 X 1
7: 104 7 X 1
8: 104 8 X 1
9: 105 9 A 0
10: 105 10 C 0
A base R option with rowsum
i1 <- with(df1, rowsum(+(Status == "X"), group = GroupID) > 0)
transform(df1, Flag = +(GroupID %in% row.names(i1)[i1]))
Or using table
df1$Flag <- +(with(df1, GroupID %in% names(which(table(GroupID,
Status == "X")[,2]> 0))))
Example data:
ID <- c('A','A','A','A','A','B','B','B','B','C','C','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','5','6','9')
Intensity <- as.numeric(c('220','192','180','175','140','227','193','163','144','232','205','190','185'))
x <- data.frame(ID, Hour, Intensity)
x
ID Hour Intensity
1 A 0 220
2 A 2 192
3 A 5 180
4 A 6 175
5 A 9 140
6 B 0 227
7 B 2 193
8 B 5 163
9 B 6 144
10 C 0 232
11 C 5 205
12 C 6 190
13 C 9 185
I want to remove all rows associated with an ID where there are non-consecutive values of Hour, according to this list:
uniqueHoursOrder <- sort(unique(Hour))
uniqueHoursOrder
[1] "0" "2" "5" "6" "9"
I want to include any ID so long as it has a row for the first value of uniqueHoursOrder (i.e. 0) and it's other rows follow in order according to the order of uniqueHoursOrder. It's OK if an ID doesn't have a row for every value of Hour in uniqueHoursOrder.
For this data, the result should be:
ID Hour Intensity
1 A 0 220
2 A 2 192
3 A 5 180
4 A 6 175
5 A 9 140
6 B 0 227
7 B 2 193
8 B 5 163
9 B 6 144
(ID C is excluded because it's missing Hour 2. B is included because it has consecutive values of Hour starting with 0, even though it doesn't have rows for Hour for all values in uniqueHoursOrder.)
A dplyr solution would be ideal, but I'll take any help I can get.
We can group by 'ID', match the 'Hour' with 'uniqueHoursOrder', get the diff of the index, check whether all the difference is equal to 1 and use that logical index to subset the rows
library(data.table)
setDT(x)[, .SD[all(diff(match(Hour, uniqueHoursOrder))==1)], ID]
# ID Hour Intensity
#1: A 0 220
#2: A 2 192
#3: A 5 180
#4: A 6 175
#5: A 9 140
#6: B 0 227
#7: B 2 193
#8: B 5 163
#9: B 6 144
The same methodology can be used with dplyr
library(dplyr)
x %>%
group_by(ID) %>%
filter(all(diff(match(Hour, uniqueHoursOrder))==1))
I have a list, containing several data frames (only 2 in this example) of different sizes.
> myList
$`1`
ID values
1 1 100
2 2 200
3 3 240
4 4 403
5 5 212
6 6 432
7 7 423
8 8 123
9 9 543
10 10 982
$`2`
ID values
1 3 432
2 5 333
3 6 981
Now, I need to omit all row's in any of the data frames that do not share their ID in any of the other data frames. In this example, the result I'm looking for is:
> myList2
$`1`
ID values
3 3 240
5 5 212
6 6 432
$`2`
ID values
1 3 432
2 5 333
3 6 981
I've tried to use dplyr::setequal() but end up with FALSE: Different number of rows. I'd prefer a base solution if possible. Thanks in advance!
Reproducible code:
myList <- list(data.frame('ID' = c(1:10), 'values' = c(100,200,240,403,212,432,423,123,543,982)),data.frame('ID' = c(3,5,6), 'values' = c(432,333,981)))
One way via base R is to use Reduce(intersect, ...) to find the common IDs from all data frames in the list. We then use that to index the data frames.
ind <- Reduce(intersect, lapply(myList, '[[', 1))
lapply(myList, function(i) i[i$ID %in% ind,])
#[[1]]
# ID values
#3 3 240
#5 5 212
#6 6 432
#[[2]]
# ID values
#1 3 432
#2 5 333
#3 6 981
I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2
I have a input dataframe like this (the real one is very large, so I want to do it faster):
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
A B C
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
5 5 9 13
I have a dataframe with replacement code like this (the entries here maybe more than df1):
df2 <- data.frame(X=c(1:15), Y=c(101:115))
X Y
1 1 101
2 2 102
3 3 103
4 4 104
5 5 105
6 6 106
7 7 107
8 8 108
9 9 109
10 10 110
11 11 111
12 12 112
13 13 113
14 14 114
15 15 115
By matching df2$X with value in df1$A and df1$B, I want to get a new_df1 by replace df1$A and df1$B with the corresponding values in df2$Y, i.e. resulting this new_df1
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
Could you mind to give me some guidance how to do it faster in R, as my dataframe is very large? Many thanks.
As Thilo mentioned Nico's answer assumes that df2 is ordered by X and X contains every integer 1,2,3....
I would prefer to use match() as a more general case:
df1 <- data.frame(A=c(1:5), B=c(5:9), C=c(9:13))
df2 <- data.frame(X=c(1:15), Y=c(101:115))
new_df1 <- df1
new_df1$A <- df2$Y[match(df1$A,df2$X)]
new_df1$B <- df2$Y[match(df1$B,df2$X)]
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
It's supereasy! You just need to get the proper offsets in the array.
So for instance, to get the Y column of df2 corresponding to the values in the A column of df1 you'll write df2$Y[df1$A]
Hence, your code will be:
df_new <- data.frame("A" = df2$Y[df1$A], "B" = df2$Y[df1$B], "C" = df1$C)
Here is another (one-liner) way of doing it.
> with(c(df2,df1),data.frame(A = Y[match(A,X)],B = Y[match(B,X)],C))
A B C
1 101 105 9
2 102 106 10
3 103 107 11
4 104 108 12
5 105 109 13
However I am not sure whether it will be faster than the other suggestions