I am facing an issue while using the seq() function inside ifelse statement. I have a dataframe which contains the following columns.
Dataframe(df): newmodel id
NewModel_1 30
NewModel_2 30
i need to increase the id value for these 2 rows since id should not be same for a model. There is constant value(99) from which we have to increment the id values based on the condition.
When i am trying to implement the below code
df %>% mutate(id=ifelse(any(grepl("NewModel_", df$newmodel)), seq(from =99+1, by =1, length.out=2) , id))
I am getting the output as
newmodel id
NewModel_1 100
NewModel_1 100
Where as the expected one is
newmode1 id
NewModel_1 100
NewModel_1 101
Can someone explain me why it is happening??
Thanks in Advance
Are you looking for something like this?
inds <- grepl('NewModel_', df$newmodel)
df$id[inds] <- seq(100, by = 1, length.out = sum(inds))
df
# newmodel id
#1 NewModel_1 100
#2 NewModel_2 101
data
df <- structure(list(newmodel = c("NewModel_1", "NewModel_2"), id = c(30L,
30L)), class = "data.frame", row.names = c(NA, -2L))
I guess is because somehow the function is getting only the first item of the seq.
You can try this way, it works here.
if(any(grepl("NewModel_", df$newmodel))) {
df$id <- seq(from = 99 + 1, length.out = (length(df$id)))
}
UPDATE: The return of ifelse statement is only one value, so you are trying to input a vector in a single element. An alternative is to use an apply function.
The reason your ifelse(.) is failing is that ifelse keys its output length based on the input length of the conditional vector; if it is shorter than either of the yes= or no= vectors, the extra length is silently ignored. In your case, any(grepl("NewModel_", df$newmodel)) will never be other than length 1, so the output will be length 1.
For example:
ifelse(TRUE, 1:2, 3:4)
# [1] 1
ifelse(c(TRUE, FALSE), 1:2, 3:4)
# [1] 1 4
### and for an example of how R's overly-permissive recycling can go "wrong"
ifelse(c(TRUE, FALSE, TRUE), 1:2, 3:4)
# [1] 1 4 1
Here's a quick method using match to assign a unique integer to each of the models.
base R
dat$newid <- 99 + match(dat$newmodel, unique(dat$newmodel))
dat
# newmodel id newid
# 1 NewModel_1 30 100
# 2 NewModel_2 30 101
dplyr
library(dplyr)
dat %>%
mutate(newid = 99 + match(newmodel, unique(newmodel)))
# newmodel id newid
# 1 NewModel_1 30 100
# 2 NewModel_2 30 101
Data
dat <- structure(list(newmodel = c("NewModel_1", "NewModel_2"), id = c(30L, 30L), newid = c(100, 101)), row.names = c(NA, -2L), class = "data.frame")
Related
I have two data.frame tables in R. Both have IDs for users who took particular actions. The users in the second table should all have done the actions in the first table, but I want to confirm. What would be the best way to determine if all the IDs in table 2 are represented in table, and if not what IDs aren't?
Table A
**Unique ID** **Count**
abc123 1
zyx456 15
888aaaa 4
Table B
**Unique ID** **Count**
abc123 1
zyx456 1
zzzzz123 2
I'm trying to get a response that abc123 and zyx456 in Table B are in Table A and that zzzzz123 is not represented in Table A but is in B (which would be an error, since all B should be in A).
This is an efficient one-liner in base R:
setdiff(TableB$ID, TableA$ID)
It will return an empty result if everything in TableB is in TableA, and return the missing fields if there are any.
Other answers may be better choices with broader context, but this is a simple solution for a simple problem.
We can do this easily with a join in the tidyverse:
library(tidyverse)
JoinedTable = full_join(
x = TableA %>% mutate(in.A = TRUE),
y = TableB %>% mutate(in.B = TRUE).
by = "UniqueID",
suffix = c(".A",".B")
)
### Use whichever of the following is applicable
## Is in both
JoinedTable %>%
filter(in.A, in.B)
## In A only
JoinedTable %>%
filter(in.A, !in.B)
## In B only
JoinedTable %>%
filter(!in.A, in.B)
Use a full_join to intersect the tables; set "by" to your ID column and adding a suffix to differentiate other columns that aren't unique to a particular column. I've added mutates to make the filtering code more clear, but you could also just look for NAs in the respective Counts columns (i.e. filter(!is.na(Count.A), is.na(Count.B)) to find ones in A but not B).
If you just want a vector of the ones that meet each condition, just tack on %>% pull(UniqueID) to grab that.
You can add another column to table B show if it is also in table A. Here is the code can make it (assuming dfA and dfB denote tables A and B):
dfB <- within(dfB, in_dfA <- UniqueID %in% tbla$UniqueID)
gives
> dfB
UniqueID Count in_dfA
1 abc123 1 TRUE
2 zyx456 1 TRUE
3 zzzzz123 2 FALSE
DATA
dfA <- structure(list(UniqueID = structure(c(2L, 3L, 1L), .Label = c("888aaaa",
"abc123", "zyx456"), class = "factor"), Count = c(1L, 15L, 4L
)), class = "data.frame", row.names = c(NA, -3L))
dfB <- structure(list(UniqueID = structure(1:3, .Label = c("abc123",
"zyx456", "zzzzz123"), class = "factor"), Count = c(1L, 1L, 2L
), in_dfA = c(TRUE, TRUE, FALSE)), row.names = c(NA, -3L), class = "data.frame")
How about using the %in% operator to see which are in both versus those that are not:
library(tibble)
library(tidyverse)
df1 <- tribble(~ID, ~Count,
'abc', 1,
'zyx', 15,
'other', 3)
df2 <- tribble(~ID, ~Count,
'abc', 2,
'zyx', 33,
'another', 334)
match <- df2[which(df2$ID %in% df1$ID),'ID']
notmatch <- df2[which(!(df2$ID %in% df1$ID)),'ID']
This outputs two comparisons that you can use to check for values in a function and pass errors if need be:
match
A tibble: 2 x 1
ID
<chr>
1 abc
2 zyx
notmatch
# A tibble: 1 x 1
ID
<chr>
1 another
You could do an update join to see which IDs are/aren't in the first table
tblb[tbla, on = 'UniqueID', in_tbla := i.UniqueID
][, in_tbla := !is.na(in_tbla)]
tblb
# UniqueID Count in_tbla
# 1: abc123 1 TRUE
# 2: zyx456 1 TRUE
# 3: zzzzz123 2 FALSE
Not sure if that's any better than #Onyambu's suggestion though (same output)
tblb[, in_tbla := UniqueID %in% tbla$UniqueID]
Data used:
tbla <- fread('
UniqueID Count
abc123 1
zyx456 15
888aaaa 4
')
tblb <- fread('
UniqueID Count
abc123 1
zyx456 1
zzzzz123 2
')
I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1
I have a dataframe where I want to replace the variables
age_1 with values of variable age1_corr_1 if age1_corr_1 is not NA
age_2 with values of variable age1_corr_2 if age1_corr_2 is not NA, ...,
age_n with values of variable age1_corr_n if age1_corr_n is not NA.
Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?
Sample data
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
The code that will change values of age_n based on age1_corr_n
for(i in 1:4){
cname1 <- paste0("age_",i)
cname2 <- paste0("age1_corr_",i)
y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}
The output I'd like to have is
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).
Here's the data you provided:
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:
library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))
Alternatively, you could choose the pattern for the columns you DON'T want:
x <- select(y, -grep("_corr_", current_vars()))
This uses the following strategy:
* you can select for everything BUT a column or set of columns by adding a minus sign first.
* current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)
Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.
library(magrittr)
y %>%
dplyr::mutate(
age1_corr_4 = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
age_1 = dplyr::coalesce(age1_corr_1, age_1),
age_2 = dplyr::coalesce(age1_corr_2, age_2),
age_3 = dplyr::coalesce(age1_corr_3, age_3),
age_4 = dplyr::coalesce(age1_corr_4, age_4)
) %>%
dplyr::select(
-age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
)
Produces
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.
Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).
Like Dan Hall's response, one approach keeps the columns you want...
library(magrittr)
coalesce_corr1 <- function( index ) {
name_age <- paste0("age_" , index)
name_corr <- paste0("age1_corr_", index)
y %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
) %>%
dplyr::select(!!name_age)
}
1:4 %>%
purrr::map(coalesce_corr) %>%
dplyr::bind_cols()
...and the other drops the columns you don't want.
z <- y
coalesce_corr2 <- function( index ) {
name_age <- paste0( "age_" , index)
name_corr <- paste0( "age1_corr_", index)
z <<- z %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
)
z[[name_corr]] <<- NULL
}
1:4 %>%
purrr::walk(coalesce_corr2)
z
I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.
Given a data set similar to the following
dat = structure(list(OpportunityId = c("006a000000zLXtZAAW", "006a000000zLXtZAAW",
"006a000000gst", "006a000000gstg", "006a000000gstg",
"006a000000zLXtZAAW"), IsWon = c(1, 1, 1, 1, 1, 1),
sequence = c("LLLML", "LHHHL", "LLLML", "HMLLL", "LLLLL", "LLLLL")),
.Names = c("OpportunityId","IsWon", "sequence"), row.names = c(NA, 6L), class = "data.frame")
dat
How would one go about adding each sequence that is associated with a particular opportunity id, such that the final looks like.
oppid sequence
006... LLL, LML, MMM
007... MMM, MML, MMH, LLL, HHH
007... LML, MMM
Any ideas?
We can paste the 'sequence' after grouping by 'OpportunityId'
library(data.table)
setDT(dat)[, .(sequence = toString(unique(sequence))) ,
by = .(oppid = OpportunityId)]
Maybe a combination of aggregate and unique could help.
aggregate(sequence ~ OpportunityId, dat, unique)
# OpportunityId sequence
#1 006a000000gst LLLML
#2 006a000000gstg HMLLL, LLLLL
#3 006a000000zLXtZAAW LLLML, LHHHL, LLLLL
As pointed out by #akrun in a comment, the sequence column is stored as a list in this case.
If necessary, the list in the sequence column can be converted into character format (a single string for each row) by means of:
dat$sequence <- sapply(dat$sequence, paste, collapse=", ")
With dplyr
library(dplyr)
dat_new <- dat %>%
group_by(OpportunityId) %>%
summarise(sequence = toString(sequence)) %>%
distinct(.keep_all = TRUE)
Output
# OpportunityId IsWon sequence
# 1 006a000000zLXtZAAW 1 LLLML, LHHHL, LLLLL
# 2 006a000000gst 1 LLLML
# 3 006a000000gstg 1 HMLLL, LLLLL
In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)
I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.
I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.
A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)
Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)