I'm trying to apply the same condition for multiple columns of an array and, then, create a new column if any of the columns meet the condition.
I can do it manually with an OR statement, but I was wondering if there is an easy way to apply it for more columns.
An example:
data <- data.frame(V1=c("A","B"),V2=c("A","A","A","B","B","B"),V3=c("A","A","B","B","A","A"))
data[4] <- ifelse((data[1]=="A"|data[2]=="A"|data[3]=="A"),1,0)
So the 4th row is the only that doesn't meet the condition for all columns:
V1 V2 V3 V1
1 A A A 1
2 B A A 1
3 A A B 1
4 B B B 0
5 A B A 1
6 B B A 1
Do you know a way to apply the condition in a shorter code?
I tried something like
data[4] <- ifelse(any(data[,c(1:3)]=="A"),1,0)
but it consider the condition for all the dataset instead of by rows, so all the rows are given 1.
We can use Reduce with lapply
data$NewCol <- +( Reduce(`|`, lapply(data, `==`, 'A')))
We can use apply row-wise :
data$ans <- +(apply(data[1:3] == "A", 1, any))
data
# V1 V2 V3 ans
#1 A A A 1
#2 B A A 1
#3 A A B 1
#4 B B B 0
#5 A B A 1
#6 B B A 1
Try:
data$V4 <- +(rowSums(data == 'A') > 0)
Output:
V1 V2 V3 V4
1 A A A 1
2 B A A 1
3 A A B 1
4 B B B 0
5 A B A 1
6 B B A 1
Related
I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)
I want to compare whether the values in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
df1 <- data.frame(a=c(1,2,3),b=c(4,5,6))
df2 <- data.frame(a=c(1,2,4),b=c(4,5,6))
> df1
a b
1 1 4
2 2 5
3 3 6
> df2
a b
1 1 4
2 2 5
3 4 6
Expected output
final <- data.frame(a=c(1,2,4),b=c(4,5,6),c=c('T','T','F'))
#c column is the result I need. whether the values in each row are the same.
>final
a b c
1 1 4 T
2 2 5 T
3 4 6 F
I try method below... but This is complicated.
#1. making idx of df1, df2
#2. and full_join
#3. and left_join df1
#4. and left_join df2
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df3<-full_join(df1,df2,by=c('a','b'))
df3<-left_join(df3,df1,by=c('a','b'))
df3<-left_join(df3,df2,by=c('a','b')) #This may or may not work..
I think there must be a better way. help!
We could use
df2$c <- Reduce(`&`, Map(`==`, df1, df2))
-output
> df2
a b c
1 1 4 TRUE
2 2 5 TRUE
3 4 6 FALSE
You can get column 'c' by:
c <- df1$a == df2$a & df1$b == df2$b
gives TRUE TRUE FALSE. It looks like you want to then bind this to df2, so
cbind.data.frame(df2, c)
You may use rowSums -
final <- df2
final$c <- rowSums(df1 != df2) == 0
final
# a b c
#1 1 4 TRUE
#2 2 5 TRUE
#3 4 6 FALSE
In case the positions of the rows in each data.frame do not matter you can use merge.
within(merge(df2, within(df1, c <- "T"), all.x=TRUE), c[is.na(c)] <- "F")
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F
or using duplicated.
df2$c <- c("F", "T")[1+tail(duplicated(rbind(df1, df2)), nrow(df2))]
df2
# a b c
#1 1 4 T
#2 2 5 T
#3 4 6 F
I want to compare whether the 'values set' in each row are the same.
In this case, duplicated and all_equal function are not suitable.
Reproducible Sample Data
a=c(1,1)
b=c(2,2)
c=c(3,3)
d=c(4,5)
df1<-rbind(a,b,c)
df1<-df1 %>% as.data.frame()
df2<-rbind(a,d,b)
df2<-df2 %>% as.data.frame()
> df1
V1 V2
a 1 1
b 2 2
c 3 3
> df2
V1 V2
a 1 1
d 4 5
b 2 2
Expected output
df1$idx1 <- 1:nrow(df1)
df2$idx2 <- 1:nrow(df2)
df1
df2
df3<-full_join(df1,df2,by=c('V1','V2'))
df3
df3$need <- ifelse(is.na(df3$idx2), 'only_df1',
ifelse(is.na(df3$idx1), 'only_df2',
'duplicated'))
> df3
V1 V2 idx1 idx2 need
1 1 1 1 1 duplicated
2 2 2 2 3 duplicated
3 3 3 3 NA only_df1
4 4 5 NA 2 only_df2
I try... but This is complicated.
I think there must be a better way. help!
Since you are already using dplyr, you may use case_when which is easier to understand and write especially when you have lot of conditions.
library(dplyr)
full_join(df1,df2,by=c('V1','V2')) %>%
mutate(need = case_when(is.na(idx2) ~ 'only_df1',
is.na(idx1) ~ 'only_df2',
TRUE ~ 'duplicated'))
# V1 V2 idx1 idx2 need
#1 1 1 1 1 duplicated
#2 2 2 2 3 duplicated
#3 3 3 3 NA only_df1
#4 4 5 NA 2 only_df2
As already mentioned in the comments, your way looks ok. In case you want to see how it could be done in base:
a <- c(1,1)
b <- c(2,2)
c <- c(3,3) #Better don't use existing function names
d <- c(4,5)
df1 <- as.data.frame(rbind(a,b,c))
df2 <- as.data.frame(rbind(a,d,b))
df1$idx1 <- seq_len(nrow(df1)) #seq_len will also work in case nrow=0
df2$idx2 <- seq_len(nrow(df2))
df3 <- merge(df1, df2, all=TRUE)
df3$need <- ifelse(is.na(df3$idx2), "only_df1",
ifelse(is.na(df3$idx1), "only_df2",
"duplicated"))
df3
# V1 V2 idx1 idx2 need
#1 1 1 1 1 duplicated
#2 2 2 2 3 duplicated
#3 3 3 3 NA only_df1
#4 4 5 NA 2 only_df2
We can use
library(arsenal)
summary(comparedf(df1, df2))
I have a data frame as below:
dat <- data.frame(
V1=c("A","B","C","A"),
V2=c("B","C","D","B"),
V3=c("C","D","","C"),
V4=c("D","","","E")
)
V1 V2 V3 V4
1 A B C D
2 B C D
3 C D
4 A B C E
Row 2 and 3 are in Row 1 in different columns. How can I filter out Row 2 and 3 so that I am left with Row 1 and 4 only?
Paste together each row. Go through the values and check if it (partially) matches any of the values using grepl.
S = trimws(Reduce(paste, dat), "right")
dat[sapply(1:length(S), function(i) !any(grepl(S[i], S[-i]))),]
# V1 V2 V3 V4
#1 A B C D
#4 A B C E
I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.
You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))
You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1