R: Update Column Based on Text Condition from Another Column - r

I would like to make a new column in my data frame by using a conditional statement that would say "If Column_y contains Column_x then 1 else 0"
For example:
Event Name Winner Loser New Column
1 James James,Bob John,Steve 1
1 Bob James,Bob John,Steve 1
1 John James,Bob John,Steve 0
1 Steve James,Bob John,Steve 0
I want to have New Column<- "If Winner contains Name then 1 else 0"
Keep in mind this is for 100,000 rows and probably 700 unique names. When I try things like
df$NewColumn<-ifelse(grepl(df$Name,df$Winner)==TRUE,1,0)
or variations I get the "pattern has a length > 1" error.

I think you just want to compare the Name column against the Winner column:
df$NewColumn <- ifelse(df$Name == df$Winner, 1, 0)
Note that because df$Name == df$Winner is actually a boolean expression, you might also be able to simplify to:
df$NewColumn <- df$Name == df$Winner

In your example, exact string matching works. But I am assuming it does not hold true for your entire data.
Implementing the contains condition would be something like this:
library(dplyr)
library(purrr)
df = df %>%
dplyr::mutate(NewColumn = purrr::map2_dbl(.x=Winner,.y=Name,~ifelse(grepl(.y,.x),1,0)))
Adding an alternate solution with stringr:
df = df %>%
dplyr::mutate(NewColumn=ifelse(str_detect(Winner,Name),1,0))
Let me know if this works.
P.S.: str_detect is faster.

Related

R, filtering for an element in a list in a dataframe cell

Let's say I have a very simple data set. I have 2 columns, Parent Name, Children
> d = data.frame(Parents = c("Mark", "Adam"))
> d$Children = list(c("Kid1", "Kid2"), c("Kid3", "Kid4"))
> d
Parents Children
1 Mark Kid1, Kid2
2 Adam Kid3, Kid4
What I want to be able to do is search by Kid and get the parent name (and the index of that parent's name but this part is easy I presume). So "Kid1" would return "Mark". I can't figure how to do this.
I've tried using the following
which(d$Children = "Kid3")
But it didn't work, presumably because the datatype is actually list.
Is there a way to get around this? Is using a dataframe here a bad idea? Is there an alternate data strcuture I should use here, I think in Python I might have tried to using a dictionary but I'm not sure how to tackle this problem in R.
For filtering an element, use lapply with %in%
as.character(d$Parent)[unlist(lapply(d$Children, `%in%`, x = 'Kid3'))]
#[1] Adam
Or with Map
as.character(d$Parents)[unlist(Map(`%in%`, "Kid3", d$Children))]
#[1] Adam
The columns in the input are factor class. So, it can be converted to character class while extracting
Or another option is stack with subset
subset(stack(setNames(d$Children, d$Parents)), values == "Kid3")$ind
Or with dplyr/purrr
library(purrr)
library(dplyr)
d %>%
filter(map_lgl(Children, `%in%`, x = "Kid3")) %>%
pull(Parents)
#[1] Adam
Or
deframe(d) %>%
keep(~ "Kid3" %in% .x) %>%
names
#[1] "Adam"
Here's a way with sapply from base R. sapply(d$Children, ...) applies the anonymous function(x) "Kid3" %in% x) to every element of d$Children. This function checks if "Kid3" is present in every element and returns one logical output per row. This logical output is then used to get corresponding Parent. Fore more examples look at ?sapply. -
d$Parent[sapply(d$Children, function(x) "Kid3" %in% x)]
[1] Adam
Levels: Adam Mark
With dplyr -
d %>% unnest() %>% filter(Children == "Kid3")
Parents Children
1 Adam Kid3

Merging multiple columns in a dataframe based on condition in R

I am very new to R, and I want to do the following:
I have a data frame that consists of ID, Col1, Col2, Col3 columns.
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text="
ID Col1 Col2 Col3
1 0 'Less than once a month' 0
2 Never 0 0
3 0 0 'Once a month'
")
I want to merge those 3 columns into one, where if there is "Never" and 0 in the other columns the value is "Never", if there is "Once a month" and the rest are 0, then "Once a month" and so on. All columns are mutually exclusive meaning there cannot be "Never" and "Once a month" in the same raw.
//I tried to apply this loop:
for (val in df) {
if(df$Col1 == "Never" && df$Col2 == "0")
{
df$consolidated <- "Never"
} else (df$`Col1 == "0" && df$Col2 == "Less than once a month")
{
how_oft_purch_gr_pers$consolidated <- "Less than once a month"
}
}
I wanted to figure first for two columns only, but it didn't work, as all raws in the consolidated column are filled with "Less than once a month".
I want it to be like this:
ID Col1 Col2 Col3 Consolidated
1 0 Less than once a month 0 Less than once a month
2 Never 0 0 Never
3 0 0 Once a month Once a month
Any hint on what am I doing wrong?
Thank you in advance
You can think of using dplyr::coalesce after replacing 0 with NA. The coalesce() finds the first non-missing value (in a row in this case) and creates a new column. The solution can be as:
library(dplyr)
df %>% mutate_at(vars(starts_with("Col")), funs(na_if(.,"0"))) %>%
mutate(Consolidated = coalesce(Col1,Col2,Col3)) %>%
select(ID, Consolidated)
# OR in concise way once can simply write as
bind_cols(df[1], Consolidated = coalesce(!!!na_if(df[-1],"0")))
# ID Consolidated
# 1 1 Less than once a month
# 2 2 Never
# 3 3 Once a month
Data:
df <- read.table(text =
"ID Col1 Col2 Col3
1 0 'Less than once a month' 0
2 Never 0 0
3 0 0 'Once a month'",
stringsAsFactors = FALSE, header = TRUE)
Even though #MKR has written a good answer, I want to point out a few errors in your code which might be the reason why it does not work
for (val in df) {
You problably want to loop over all rows of df. However, in fact you are looping over columns of your data frame. The reason is that a data frame is a list of vectors (your columns) which all must have the same length. With your code you iterate over the elements of df, which is the columns. See Q&A For each row in data.frame
if(df$Col1 == "Never" && df$Col2 == "0"){
Note that when using the double && instead of &, R is looking only at the first element of the vector you give it. See for example Q&A Boolean Operators && and ||
df$consolidated <- "Never"
Here, you set the whole column consolidated of df to "Never", because you do not use the iteration var from above (even if it would stand for one df row which it does not, like you wrote it).
} else (df$`Col1 == "0" && df$Col2 == "Less than once a month"){
You need to use else if(...), not else (...). Like you wrote it, R will think the statement in (....) should be executed if the if(...) above is not true and the statement in {...} after the if would be regarded by R as having nothing to do with the if... else... construct, because it already executed (...). So it will execute the {...} block always, regardless of what is the outcome of the above if(...).
Is df$`Col1 a typo? The backtick ` should only occur in pairs and can be used around variables (also column names)
df$consolidated <- "Less than once a month"
Here you again set a whole column to one value, like explained above.
}
}
This is a possiblity using base R
Start your result column. Initialize it with only "0".
df$coalesced <- "0"
Loop over some columns of df (Col1--Col3). Use drop = FALSE in case you might only use one column, because R would output a vector in that case and for would loop over the elements of that vector and not over the single column in that case.
for( column in d[, c("Col1","Col2","Col3"), drop = FALSE]){
This checks each of coalesced if it is already filled, and if not (if it is "0" it fill it with the current column (which may also be "0")
df$coalesced <- ifelse(df$coalesced == "0", column, df$coalesced)
}
Add the new column to your data frame
df$coalesced <- coalesced

how to strcount each element dataframe in r

I have a little problem to print and count one by one array/dataframe string appearance
I have a dataframe called Pos_1, it contains string like these :
Pos_1 = (morning bliss great happy)
and the other dataframe called Pos_2, it contains string like these :
Pos_2 = (morning great)
What I want to do is, count the string that appear from Pos_1 dataframe in Pos_2
I'm using the str_count to count each string that appear
for(h in 1:5)
Score=sum(str_count(Pos_2, Pos_1[h]))[1:length(Pos_1)]
from the code above it only return the total of all string element from Pos_1
Text Score
morning 0
bliss 0
great 0
happy 0
expected result from count the element that match from dataframe Pos_1 and dataframe Pos_2 with strcount (see below),
i need to produce Only the Score Row below
Text Score
morning 1
bliss 0
great 1
happy 0
is there any solution ?
I think this does what you want:
library(stringr)
Score <- sapply(seq_along(unlist(Pos_1)), function(i)
sum(str_count(unlist(Pos_2), unlist(Pos_1)[i])))
You use unlist to convert your data frames of strings into vectors. Then you use sapply to iterate str_count over the elements of the unlisted Pos_1, getting a vector in return.
If each element of Pos_1 will appear no more than once in Pos_2, you don't need str_count and could just use:
Score <- +(unlist(Pos_1) %in% unlist(Pos_2))
try this
library(stringr)
Pos_1 <- c("morning", "bliss", "great","happy")
Pos_2 <- c("morning", "great")
df<-data.frame(Text=Pos_1,Score=unlist(lapply(Pos_1,function(x) sum(str_count(x,Pos_2)))))
df
output
Text Score
1 morning 1
2 bliss 0
3 great 1
4 happy 0

check if column contains part of another column in r

I have a dataframe with registration numbers in one column and correct registration number in another
a <- c("0c1234", "", "2468O")
b <- c("Oc1234", "Oc5678", "Oc9123")
df <- data.frame(a, b)
I wish to update row 1 as it was entered incorrectly, row 2 is blank so I would like to update the field. Row 3 has a different number, so I wish to keep this number, but make a new entry for this row (in another program, I just need to know that it needs to be inserted).
How do I produce this dataframe?
c <- c("update", "update", "insert")
df2 <- data.frame (a,b,c)
I have tried grepl and str_detect and also considered regex expressions with the grepl - ie check if the 4 number combination in column a is in column b but as yet have been unsuccessful
You can do this in this way:
df <- data.frame(a,b,stringsAsFactors = F)
for (i in seq(1,nrow(df))){
if (df$a[i] == '' || length(agrep(df$a[i],df$b[i])) > 0)
df$c[i] <- 'update'
else
df$c[i] <- 'insert'
}
df
## a b c
##1 0c1234 Oc1234 update
##2 Oc5678 update
##3 2468O Oc9123 insert
You can do something like this:
df$c <- ifelse(a == '', 'update', 'insert')
Your output will be as follows (desired df2 in your question):
a b c
1 0c1234 Oc1234 insert
2 Oc5678 update
3 2468O Oc9123 insert
This will only work, of course, if your original data frame has 'transactions' in proper order.

removing duplicate subsets of rows

I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!
It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1

Resources