Merging multiple columns in a dataframe based on condition in R - r

I am very new to R, and I want to do the following:
I have a data frame that consists of ID, Col1, Col2, Col3 columns.
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text="
ID Col1 Col2 Col3
1 0 'Less than once a month' 0
2 Never 0 0
3 0 0 'Once a month'
")
I want to merge those 3 columns into one, where if there is "Never" and 0 in the other columns the value is "Never", if there is "Once a month" and the rest are 0, then "Once a month" and so on. All columns are mutually exclusive meaning there cannot be "Never" and "Once a month" in the same raw.
//I tried to apply this loop:
for (val in df) {
if(df$Col1 == "Never" && df$Col2 == "0")
{
df$consolidated <- "Never"
} else (df$`Col1 == "0" && df$Col2 == "Less than once a month")
{
how_oft_purch_gr_pers$consolidated <- "Less than once a month"
}
}
I wanted to figure first for two columns only, but it didn't work, as all raws in the consolidated column are filled with "Less than once a month".
I want it to be like this:
ID Col1 Col2 Col3 Consolidated
1 0 Less than once a month 0 Less than once a month
2 Never 0 0 Never
3 0 0 Once a month Once a month
Any hint on what am I doing wrong?
Thank you in advance

You can think of using dplyr::coalesce after replacing 0 with NA. The coalesce() finds the first non-missing value (in a row in this case) and creates a new column. The solution can be as:
library(dplyr)
df %>% mutate_at(vars(starts_with("Col")), funs(na_if(.,"0"))) %>%
mutate(Consolidated = coalesce(Col1,Col2,Col3)) %>%
select(ID, Consolidated)
# OR in concise way once can simply write as
bind_cols(df[1], Consolidated = coalesce(!!!na_if(df[-1],"0")))
# ID Consolidated
# 1 1 Less than once a month
# 2 2 Never
# 3 3 Once a month
Data:
df <- read.table(text =
"ID Col1 Col2 Col3
1 0 'Less than once a month' 0
2 Never 0 0
3 0 0 'Once a month'",
stringsAsFactors = FALSE, header = TRUE)

Even though #MKR has written a good answer, I want to point out a few errors in your code which might be the reason why it does not work
for (val in df) {
You problably want to loop over all rows of df. However, in fact you are looping over columns of your data frame. The reason is that a data frame is a list of vectors (your columns) which all must have the same length. With your code you iterate over the elements of df, which is the columns. See Q&A For each row in data.frame
if(df$Col1 == "Never" && df$Col2 == "0"){
Note that when using the double && instead of &, R is looking only at the first element of the vector you give it. See for example Q&A Boolean Operators && and ||
df$consolidated <- "Never"
Here, you set the whole column consolidated of df to "Never", because you do not use the iteration var from above (even if it would stand for one df row which it does not, like you wrote it).
} else (df$`Col1 == "0" && df$Col2 == "Less than once a month"){
You need to use else if(...), not else (...). Like you wrote it, R will think the statement in (....) should be executed if the if(...) above is not true and the statement in {...} after the if would be regarded by R as having nothing to do with the if... else... construct, because it already executed (...). So it will execute the {...} block always, regardless of what is the outcome of the above if(...).
Is df$`Col1 a typo? The backtick ` should only occur in pairs and can be used around variables (also column names)
df$consolidated <- "Less than once a month"
Here you again set a whole column to one value, like explained above.
}
}

This is a possiblity using base R
Start your result column. Initialize it with only "0".
df$coalesced <- "0"
Loop over some columns of df (Col1--Col3). Use drop = FALSE in case you might only use one column, because R would output a vector in that case and for would loop over the elements of that vector and not over the single column in that case.
for( column in d[, c("Col1","Col2","Col3"), drop = FALSE]){
This checks each of coalesced if it is already filled, and if not (if it is "0" it fill it with the current column (which may also be "0")
df$coalesced <- ifelse(df$coalesced == "0", column, df$coalesced)
}
Add the new column to your data frame
df$coalesced <- coalesced

Related

How to replace zeros and ones instead of words into a data frame that has quantitative and qualitative data

I have 6 columns in my data frame the column names are exam 1, exam 2, exam 3, result exam 1, result exam 2, result exam 3 respectively the first three columns have numbers and NAs and the last three columns have Pass and Fail and NAs. I want to replace all the NAs with 0s and I want to replace instead of all the Pass words with 1s and instead of all the Fail words with 0s. So I want to replace the Fail with zeros and the NAs also with zeros.
I have used multiple approaches in R but I can't make it work.
df[df == 'NA'] <- 0 , df[df == NA] <- 0
df[df$"result exam 1" == "Pass",]$"result exam 1" = 1
df[df$"result exam 1" == "Fail",]$"result exam 1" = 0
None of these codes are working.
Would someone be able to please help with this problem?
Thank you
You really need to get a better grasp of basic R syntax:
You are putting the subset operator [ in the wrong place (you need to subset your vector, not the data frame)
You are then using the $ operator on the result of the previous operation, and that throws an error (that you should have posted) because $ cannot be used on vectors.
You are testing a value for missingness: x == NA has no sense: how can you check a non-available value? You must use the is.na() function.
Here is what you should have done (with just a bit of help from basic R tutorials):
df$exam.results.1[df$exam.results.1 == 1] <- "Pass"
df$exam.results.1[df$exam.results.1 == 0] <- "Fail"
df$exam[is.na(df$exam)] <- 0
To do this for multiple columns in one go you can use the following -
cols <- grep('result', names(df))
df[cols][is.na(df[cols])] <- 0
df[cols][df[cols] == 'Fail'] <- 0
df[cols][df[cols] == 'Pass'] <- 1
df
Assuming the name of the data frame is dt.
Make a vector for the names of result columns
result <- c("result exam 1", "result exam 2", "result exam 3")
dt <- dt %>% mutate_at(result, ~ifelse(.x == "Pass", 1, 0) )
This will replace all "pass" with 1 and rest of fail and NA with 0.
For NA s in other columns
dt[c("exam 1", "exam 2", "exam 3")][is.na(dt[c("exam 1", "exam 2", "exam 3")])] <- 0

R: Update Column Based on Text Condition from Another Column

I would like to make a new column in my data frame by using a conditional statement that would say "If Column_y contains Column_x then 1 else 0"
For example:
Event Name Winner Loser New Column
1 James James,Bob John,Steve 1
1 Bob James,Bob John,Steve 1
1 John James,Bob John,Steve 0
1 Steve James,Bob John,Steve 0
I want to have New Column<- "If Winner contains Name then 1 else 0"
Keep in mind this is for 100,000 rows and probably 700 unique names. When I try things like
df$NewColumn<-ifelse(grepl(df$Name,df$Winner)==TRUE,1,0)
or variations I get the "pattern has a length > 1" error.
I think you just want to compare the Name column against the Winner column:
df$NewColumn <- ifelse(df$Name == df$Winner, 1, 0)
Note that because df$Name == df$Winner is actually a boolean expression, you might also be able to simplify to:
df$NewColumn <- df$Name == df$Winner
In your example, exact string matching works. But I am assuming it does not hold true for your entire data.
Implementing the contains condition would be something like this:
library(dplyr)
library(purrr)
df = df %>%
dplyr::mutate(NewColumn = purrr::map2_dbl(.x=Winner,.y=Name,~ifelse(grepl(.y,.x),1,0)))
Adding an alternate solution with stringr:
df = df %>%
dplyr::mutate(NewColumn=ifelse(str_detect(Winner,Name),1,0))
Let me know if this works.
P.S.: str_detect is faster.

How to drop a buffer of rows in a data frame around rows of a certain condition

I am trying to remove rows in a data frame that are within x rows after rows meeting a certain condition.
I have a data frame with a response variable, a measurement type that represents the condition, and time. Here's a mock data set:
data <- data.frame(rlnorm(45,0,1),
c(rep(1,15),rep(2,15),rep(1,15)),
seq(
from=as.POSIXct("2012-1-1 0:00", tz="EST"),
to=as.POSIXct("2012-1-1 0:44", tz="EST"),
by="min"))
names(data) <- c('Variable','Type','Time')
In this mock case, I want to delete the first 5 rows in condition 1 after condition 2 occurs.
The way I thought about solving this problem was to generate a separate vector that determines the distance that each observation that is a 1 is from the last 2. Here's the code I wrote:
dist = vector()
for(i in 1:nrow(data)) {
if(data$Type[i] != 1) dist[i] <- 0
else {
position = i
tempcount = 0
while(position > 0 && data$Type[position] == 1){
position = position - 1
tempcount = tempcount + 1
}
dist[i] = tempcount
}
}
This code will do the trick, but it's extremely inefficient. I was wondering if anyone had some cleverer, faster solutions.
If I understand you correctly, this should do the trick:
criteria1 = which(data$Type[2:nrow(data)] == 2 & data$Type[2:nrow(data)] != data$Type[1:nrow(data)-1]) +1
criteria2 = as.vector(sapply(criteria1,function(x) seq(x,x+5)))
data[-criteria2,]
How it works:
criteria1 contains indices where Type==2, but the previous row is not the same type. The strange lookign subsets like 2:nrow(data) are because we want to compare to the previous row, but for the first row there is no previous row. herefore we add +1 at then end.
criteria2 contains sequences starting with the number in criteria1, to those numbers+5
the third row performs the subset
This might need small modification, I wasn't exactly clear what criteria 1 and criteria 2 were from your code. Let me know if this works or you need any more advice!

Dynamic merge in R

I have an example filter table as below and a big source data table. I need to do the merge using these two tables. If no column in the filter table contains ALL, use three columns to do the the merge (using Tran=1001, Acct=1 & Co=a to do the inner join with the data table).If one of them, ie Tran has ALL, use the remaining two columns to do the merge (using Acct=3 & Co=c to do the join). If two of them, ie Tran and Acct, have All, use the remaining one column to do the merge (using Co=b to do the join).
The real question is the number of columns is uncertain.
Can anyone help me with this?
Tran Acct Co
1001 1 a
1002 ALL ALL
ALL ALL b
ALL 4 ALL
1003 2 ALL
ALL 3 c
1004 ALL d
You're going to have to write a series of conditional statements using if, elseif and else. I'll use the %in% operator to check for this. The %in% operator returns a series of boolean values. The easiest way is to show through example:
> x <- c(1, 2, 3, 4, 5)
> y <- c(2, 3, 4, 5, 6)
> x %in% y
[1] FALSE TRUE TRUE TRUE TRUE
Notice that it returns FALSE for the first value as the value of 1 in x is not in y. You can do the same for the "ALL" value in your data set. I assume you are going row by row as you seemed to imply in your question. Let me know if you need to check the whole column first (you can use the any function for that case). Here is an example of your first condition:
# Assume that df is your data.frame of data.
for (i in 1:length(df$Tran)) {
if (!("All" %in% df$Tran[i]) & !("ALL" %in% df$Acct[i]) & !("All" %in% df$Co[i])) {
# Do your merge here
}
if ( [Put your next condition here] ) {
# Do the appropriate merge for that condition
}
...
Note that I used the "!" operator to get the inverse of whatever %in% returns because you want it to be the case where ALL is NOT in the row. I realize now that you could have just done All != df$Tran[1] since you are going row by row, but %in% might be more useful if you end up going for the whole column.
Hope this helps!
Editing in a new method now that it's more clear what the need is. So we have to find the number of "ALL" values in each row and then merge a certain way depending on the number of them. There are a lot of methods, but here's one I like:
> test <- data.frame(a = "ALL", b = 2, c = "ALL", d = 3, e = "ALL")
> test
a b c d e
1 ALL 2 ALL 3 ALL
> table(test[1, ] == "ALL")["TRUE"]
TRUE
3
Basically, I'm looking at the first row, and getting the number that return TRUE when asked if it contains the string "ALL". From here you can set conditionals on this number. To automate over the entire data frame, throw it in a for loop and set "1" equal to "i" or whatever you sequence variable is.
To get which rows have "ALL" in it (which in converse would tell which rows do not have "ALL" in it as well), you can use grep on each row. Here's a short example:
> # Initializing a sample data frame.
> df <- data.frame(a = "1", b = "ALL", c = "ALL", d = "5", e = "ALL")
> print(df)
a b c d e
1 1 ALL ALL 5 ALL
>
> # Finding the column numbers that have "ALL" in it using grep.
> places <- grep("ALL", df[1, ])
> print(places)
[1] 2 3 5
>
> # Each number corresponds to the order of the columns in the data frame and can be returned as such.
> nameCols <- names(df)[places]
> print(nameCols)
[1] "b" "c" "e"
>
> # Likewise, you can find what columns did not have "ALL" in it by doing the opposite.
> nameColsNOT <- names(df)[-places]
> print(nameColsNOT)
[1] "a" "d"
Iterate this method through a loop for each row in your data frame and use the conditional method I outlined above. Please note that this requires your columns to all be of "character" class, which I assume is the case already.

Searching for values in one data frame using another

I have 2 data frames: dfA and dfB
I would like to be able to extract whole rows from dfB that meet criteria based on dfA
Example:
if (dfA$colA == dfB$colB) && (dfB$colC >= dfA$colD) &&
(dfB$colC <= dfA$colE) { print rows from dfB }
The values from the 1st column in dfA need to be an exact match for the 2nd column in dfB
AND
the values from column 3 in dfB need to fall within a range set by columns 4 and 5 in dfA.
The output should be the rows from dfB that meet these criteria.
I am not sure with R but I guess it must be similar to Pandas: Just create three boolean masks, one for each criteria, than combine those to an overall-mask.
Example:
1stBoolMask = dfB[dfA$colA == dfB$colB] -> Something like ( 0 0 1 1 0 1 0 1 ... ) returns. A "1" stands for every matching entry in dfB.
2ndBoolMask = ...
3rdBoolMask = ...
-> OverallMask = 1stBoolMask & 2ndBoolMask & 3rdBoolMask
Then apply this one to dfB and you should be done. The "1s" in the resulting filter represent the matching lines of dfB.

Resources