Compare two dataframe colums and add them to the dataframe - r

I have a dataframe with two columns. I want to add a new colume to df where all the values are inside, were the dataframe matches with the first colume.
I tried:
df<-data.frame(A=c("1","test","2","3",NA,"Test", NA),B=c("1","No Match","No Match","3",NA,"Test", "No Match"))
df[df$A == df$B ]
However, I get:
Error in Ops.factor(df$A, df$B) : level sets of factors are different
Any recommednation what I am doing wrong?

Dealing with NA first and then add your column:
> df[is.na(df)]=""
> df$New = with(df, A==B)
> df
A B New
1 1 1 TRUE
2 test No Match FALSE
3 2 No Match FALSE
4 3 3 TRUE
5 TRUE
6 Test Test TRUE
7 No Match FALSE
Or remove NA from your initial data.frame with df = df[complete.cases(df),] and then add the column.
If you really want to have False when there is NA in A or B column:
> transform(df, New=ifelse(is.na(A)|is.na(B), FALSE, df$A==df$B))
A B New
1 1 1 TRUE
2 test No Match FALSE
3 2 No Match FALSE
4 3 3 TRUE
5 <NA> <NA> FALSE
6 Test Test TRUE
7 <NA> No Match FALSE

Related

Roll condition ifelse in R data frame

I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?
Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE

How to add dummy variables to data with specific characteristic

My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1

Finding index of vector from two other vectors

I am trying to index a vector using two other vectors. I can index from the first vector with no issue, but my conditional index from the second value has been causing issues.
My example code is as follows
h=rep(seq(1:24),90)
m=rep(1:3,each=24*30)
d=rep(1:30,each=24*3)
My goal is to use match (or some other function if better suited) to determine which hour values correspond to all of the instances in which m=1 or m=2 and d=2 or 3
My attempt is as follows
hin=match(m,1)|(match(m,2)&match(d,2:3))
In this case elements 1:720, and 745:793 should result in True however only 1:720 are TRUE. How do I execute the second portion of the above argument so the later values are identified?
EDIT
To create a more reproducible example:
h2=rep(seq(1:5),4)
d2=rep(rep(1:2,each=5),2)
m2=rep(1:2,each=10)
Goal is to create a logical vector containing 10 TRUE, 5 FALSE, 5 TRUE
(m=1 or (m=2 and d=2))
Eventually using this logical vector will create a new h2 removing elements 11:15
Goal outputs:
hin2=c(rep("TRUE",10),rep("FALSE",5),rep("TRUE",5))
h2new=c(rep(seq(1:5),2),rep(NA,5),seq(1:5))
hin2
[1] "TRUE" "TRUE" "TRUE" "TRUE" "TRUE" "TRUE" "TRUE" "TRUE" "TRUE" "TRUE" "FALSE"
[12] "FALSE" "FALSE" "FALSE" "FALSE" "TRUE" "TRUE" "TRUE" "TRUE" "TRUE"
h2new
[1] 1 2 3 4 5 1 2 3 4 5 NA NA NA NA NA 1 2 3 4 5
Using literally logical operators gives the boolean hin2. Then just replace the negation with NA.
hin2 <- m2 %in% 1 | (m2 %in% 2 & d2 %in% 2)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
# [13] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
h2new <- replace(h2, !hin2, NA)
# [1] 1 2 3 4 5 1 2 3 4 5 NA NA NA NA NA 1 2 3 4 5
To select the values, do:
h_new <- h2[!is.na(h2new)]
# [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
You could replace h2 values which satisfy the condition with NA :
h2[!((m2 == 1) | (m2 == 2 & d2 == 2))] <- NA
h2
#[1] 1 2 3 4 5 1 2 3 4 5 NA NA NA NA NA 1 2 3 4 5

Create a true/false variable in R

I have one variable column that contains large string values which are multiple words. I want to create a True/False column which reports true if a certain value is detected within the column of interest.
I have tried a mutate function with an embedded str_detect.
Dataset <- Dataset %>%
mutate(new_column = str_detect('column.of.interest', "abcd"))
My expected output was for all rows in which my column of interest contained "abcd" would report as TRUE in my new column. However, every row reports as FALSE in my new column.
Base R version. First create a sample data set (questioner: you should have done this; answerers: you should always do this):
> Dataset = data.frame(ID=1:10, column.of.interest=c(NA,"This","abcd","Foo","the abcde",NA,"Me","my","mo","END"))
which looks like this:
> Dataset
ID column.of.interest
1 1 <NA>
2 2 This
3 3 abcd
4 4 Foo
5 5 the abcde
6 6 <NA>
7 7 Me
8 8 my
9 9 mo
10 10 END
Then do:
> Dataset$new_column <- grepl("abcd", Dataset$column.of.interest, ignore.case = T)
to get:
> Dataset
ID column.of.interest new_column
1 1 <NA> FALSE
2 2 This FALSE
3 3 abcd TRUE
4 4 Foo FALSE
5 5 the abcde TRUE
6 6 <NA> FALSE
7 7 Me FALSE
8 8 my FALSE
9 9 mo FALSE
10 10 END FALSE
You may or may not want ignore.case.
Here is one answer which from based on a dataset from ggplot2
library(ggplot2)
library(dplyr)
diamonds %>% mutate(newCol = str_detect(clarity, "1"))
Original bad version of answer (see comments for why the above is better)
diamonds %>% mutate(newCol = ifelse(str_detect(clarity, "1"), "TRUE", "FALSE"))

Group index from column labeling the last element in each group

I'm trying to subset a data frame. The data frame is to be broken into subsets, where the last element in each subset has a "TRUE" value in the "bool" column. Consider the following data frame:
df <- data.frame(c(3,1,3,4,1,1,4), rnorm(7))
df <- cbind(df, df[,1] != 1)
names(df) <- c("ind", "var", "bool")
df
# ind var bool
# 1 3 0.02343906 TRUE
# 2 1 0.94786193 FALSE
# 3 3 0.50632766 TRUE
# 4 4 0.24655548 TRUE
# 5 1 -1.58103304 FALSE
# 6 1 0.73999468 FALSE
# 7 4 0.10929906 TRUE
Row 1 should be a subset, rows 2 and 3 should be a subset, row 4 a subset and then rows 5 through 7 a subset. The code I have below works (I can subset on the new column), but I was wondering if there was a more "R" way of doing it.
index = 1
for (i in 1:nrow(df))
{
if(df$bool[i])
{df$index[i] = index
index = index + 1
}
else
{df$index[i] = index
}
}
df
# ind var bool index
# 1 3 0.02343906 TRUE 1
# 2 1 0.94786193 FALSE 2
# 3 3 0.50632766 TRUE 2
# 4 4 0.24655548 TRUE 3
# 5 1 -1.58103304 FALSE 4
# 6 1 0.73999468 FALSE 4
# 7 4 0.10929906 TRUE 4
The first thought I would have would be to use the cumulative sum (cumsum) on the bool column to get the group indices -- this will increase the index value by 1 every time the bool value is TRUE:
df$index <- cumsum(df$bool)
df
# ind var bool index
# 1 3 -1.0712125 TRUE 1
# 2 1 0.4994369 FALSE 1
# 3 3 2.1335274 TRUE 2
# 4 4 -1.5950432 TRUE 3
# 5 1 0.5919880 FALSE 3
# 6 1 2.7039831 FALSE 3
# 7 4 -1.3526646 TRUE 4
This is not quite right because all the observations before the TRUE of each group are assigned to the previous group. We can fix that by adding 1 for all the observations with bool set to FALSE:
df$index <- cumsum(df$bool) + !df$bool
df
# ind var bool index
# 1 3 -1.0712125 TRUE 1
# 2 1 0.4994369 FALSE 2
# 3 3 2.1335274 TRUE 2
# 4 4 -1.5950432 TRUE 3
# 5 1 0.5919880 FALSE 4
# 6 1 2.7039831 FALSE 4
# 7 4 -1.3526646 TRUE 4
Splitting the data frame into a list of subsets can now be achieved efficiently with subsets <- split(df, df$index).

Resources