Comparing two dataframes with a condition - r

I have two dataframes:
df1 <- data.frame( v1 = c(1,2,3,4),
v2 = c(2, 10, 5, 11),
v3=c(20, 25, 23, 2))
> df1
v1 v2 v3
1 1 2 20
2 2 10 35
3 3 5 23
4 4 11 2
df2 <- data.frame(v1 = 4, = 10, v3 = 30)
> df2
v1 v2 v3
1 4 10 30
I want to add a new column that would say "Fail" when df1 is larger than df2 and "Pass" when it is smaller so that the intended results would be:
> df3
v1 v2 v3 check
1 1 2 20 Pass
2 2 10 35 Fail
3 3 5 23 Pass
4 4 11 2 Fail

You can make size of both the dataframes similar and directly compare :
ifelse(rowSums(df1 >= df2[rep(1,length.out = nrow(df1)), ]) == 0, 'Pass', 'Fail')
#[1] "Pass" "Fail" "Pass" "Fail"
Or using Map :
ifelse(Reduce(`|`, Map(`>=`, df1, df2)), 'Fail', 'Pass')
#Other similar alternatives :
#c('Pass', 'Fail')[Reduce(`|`, Map(`>=`, df1[-1], df2[-1])) + 1]
#c('Fail', 'Pass')[(rowSums(mapply(`>=`, df1, df2)) == 0) + 1]

In tidyverse, we can make use of c_across
library(dplyr) # >= 1.0.0
df1 %>%
rowwise %>%
mutate(check = c('Pass', 'Fail')[1 + any(c_across(everything()) >= df2)])
# A tibble: 4 x 4
# Rowwise:
# v1 v2 v3 check
# <dbl> <dbl> <dbl> <chr>
#1 1 2 20 Pass
#2 2 10 25 Fail
#3 3 5 23 Pass
#4 4 11 2 Fail

Related

split columns in "the middle" in R

I have an example data frame as such:
df_1 <- as.data.frame(cbind(c(14, 27, 38), c(25, 33, 52), c(85, 12, 23)))
Now, I want to split all these columns down the middle so that i get something that would look like this:
df_2 <- as.data.frame(cbind(c(1, 2, 3), c(4, 7, 8), c(2,3,5), c(5, 3, 2), c(8, 1, 2), c(5, 2, 3)))
So my question then is: Is there a command/package that can do this automatically?
In my real data frame I am looking to split columns by name, from an earlier regression where i got the names by inserting:
paste0(names(df)[i], "~", names(df)[j]) into my loop.
My thought, however, is that this will be quite easy once i find the right command for the data frames given above.
Thanks in advance!
You can use strsplit in base R:
as.data.frame(t(apply(df_1, 1, \(x) as.numeric(unlist(strsplit(as.character(x), ""))))))
V1 V2 V3 V4 V5 V6
1 1 4 2 5 8 5
2 2 7 3 3 1 2
3 3 8 5 2 2 3
Another possible solution:
library(tidyverse)
map(df_1, ~ str_split(.x, "", simplify = T)) %>% as.data.frame %>%
`names<-`(str_c("V", 1:ncol(.))) %>% type.convert(as.is = T)
#> V1 V2 V3 V4 V5 V6
#> 1 1 4 2 5 8 5
#> 2 2 7 3 3 1 2
#> 3 3 8 5 2 2 3
Thanks for the answers, they were a lot of help!
I ended up using the tidyr package with command:
test <- as.data.frame(separate(data = test, col = "V1", into = c("col_1", "col_2"), sep = "\\~"))
This worked great for me since I ran a regression earlier and had a good operator for separation: "~"
A base R, option would be to use read.fwf
v1 <- do.call(paste0, df_1)
read.fwf(textConnection(v1), widths = rep(1, max(nchar(v1))))
-output
V1 V2 V3 V4 V5 V6
1 1 4 2 5 8 5
2 2 7 3 3 1 2
3 3 8 5 2 2 3
Another option is to use the splitstackshape package:
df_2 <- df_1 %>%
splitstackshape::cSplit(., names(.), sep = "", stripWhite = F, type.convert = F) %>%
setnames(paste0("V", 1:ncol(.)))
Output
df_2
V1 V2 V3 V4 V5 V6
1: 1 4 2 5 8 5
2: 2 7 3 3 1 2
3: 3 8 5 2 2 3

For each row, identify the proportion of columns that have the same value in R

I have a dataset of survey responses similar to this:
toy <- data.frame(v1 = c(1,2,3), v2 = c(1,6,3), v3 = c(1,2,4), v4 = c(1,7,3))
toy
v1 v2 v3 v4
1 1 1 1 1
2 2 6 2 7
3 3 3 4 3
I want to detect "straightlining" by finding the most common value for each row and calculating the proportion of columns with that value.
Two examples:
if the value of every column in a row is 5, then the new variable should return 1
If the value of 60% of the columns in a row is 3 and 40% of the columns is 4, then the variable should return .6
Desired output:
v1 v2 v3 v4 straightline_pct
1 1 1 1 1 1
2 2 6 2 7 .50
3 3 3 4 3 .75
One base approach:
toy <- data.frame(v1 = c(1,2,3), v2 = c(1,6,3), v3 = c(1,2,4), v4 = c(1,7,3))
toy$straightline_pct = apply(as.matrix(toy),
1L,
function (x) max(prop.table(table(x)))
)
toy
#> v1 v2 v3 v4 straightline_pct
#> 1 1 1 1 1 1.00
#> 2 2 6 2 7 0.50
#> 3 3 3 4 3 0.75
Slight variation with just table
toy$straightline_pct <- apply(toy, 1, function(x) max(table(x))/length(x) )
toy
v1 v2 v3 v4 straightline_pct
1 1 1 1 1 1.00
2 2 6 2 7 0.50
3 3 3 4 3 0.75
A possible solution:
library(tidyverse)
toy <- data.frame(v1 = c(1,2,3), v2 = c(1,6,3), v3 = c(1,2,4), v4 = c(1,7,3))
toy %>%
rowwise %>%
mutate(perc = table(c_across(everything())) %>%
{(ncol(toy) - length(.) + 1) / ncol(toy)}) %>% ungroup
#> # A tibble: 3 × 5
#> v1 v2 v3 v4 perc
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1 1
#> 2 2 6 2 7 0.5
#> 3 3 3 4 3 0.75
An alternative, based on ave function:
toy %>%
rowwise %>%
mutate(perc = c_across(1:4) %>%
{max(ave(., ., FUN=length)) / ncol(toy)}) %>% ungroup
I like #Paul Smith's and #Cole's answers better, but for completeness here's a more verbose approach:
library(tidyverse)
toy %>%
bind_cols(toy %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
count(row, value) %>%
group_by(row) %>%
mutate(straightline_pct = n / sum(n)) %>%
slice_max(straightline_pct) %>%
ungroup() %>%
select(straightline_pct)
)
Here is a simple and verbose solution that is largely similar to other answers already:
library(tidyverse)
toy %>%
rowwise() %>%
mutate(
straightline_pct = max(table(c_across(everything()))) / ncol(.)
) %>%
ungroup()
# A tibble: 3 x 5
v1 v2 v3 v4 straightline_pct
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 2 6 2 7 0.5
3 3 3 4 3 0.75
If you can turn your data into a matrix first:
toy <- data.matrix(toy)
storage.mode(toy) <- "integer"
Then with the help of matrixStats:
library(matrixStats)
rowMaxs(rowTabulates(toy)) / ncol(toy)
[1] 1.00 0.50 0.75

Binding rows based on common id

I have a very simple case where I want to combine several data frames into one based on a common id elements of a particular data frame.
Example:
id <- c(1, 2, 3)
x <- c(10, 12, 14)
data1 <- data.frame(id, x)
id <- c(2, 3)
x <- c(20, 22)
data2 <- data.frame(id, x)
id <- c(1, 3)
x <- c(30, 32)
data3 <- data.frame(id, x)
Which gives us,
$data1
id x
1 1 10
2 2 12
3 3 14
$data2
id x
1 2 20
2 3 22
$data3
id x
1 1 30
2 3 32
Now, I want to combine all three data frames based on the id's of the data3. The expected output should look like
> comb
id x
1 1 10
2 1 NA
3 1 30
4 3 14
5 3 22
6 3 32
I am trying the following, but not getting the expected output.
library(dplyr)
library(tidyr)
combined <- bind_rows(data1, data2, data3, .id = "id") %>% arrange(id)
Any idea how to get the expected output?
Does this work:
library(dplyr)
library(tidyr)
data1 %>% full_join(data2, by = 'id') %>% full_join(data3, by = 'id') %>% arrange(id) %>% right_join(data3, by = 'id') %>%
pivot_longer(cols = -id) %>% select(-name) %>% distinct()
# A tibble: 6 x 2
id value
<dbl> <dbl>
1 1 10
2 1 NA
3 1 30
4 3 14
5 3 22
6 3 32
Combine the 3 dataframes in one list and use filter to select only the id's in 3rd dataframe.
library(dplyr)
library(tidyr)
bind_rows(data1, data2, data3, .id = "new_id") %>%
filter(id %in% id[new_id == 3]) %>%
complete(new_id, id)
# new_id id x
# <chr> <dbl> <dbl>
#1 1 1 10
#2 1 3 14
#3 2 1 NA
#4 2 3 22
#5 3 1 30
#6 3 3 32
A pure base R solution can also make it
lst <- list(data1, data2, data3)
reshape(
subset(
reshape(
do.call(rbind, Map(cbind, lst, grp = seq_along(lst))),
idvar = "id",
timevar = "grp",
direction = "wide"
),
id %in% lst[[3]]$id
),
idvar = "id",
varying = -1,
direction = "long"
)[c("id", "x")]
which gives
id x
1.1 1 10
3.1 3 14
1.2 1 NA
3.2 3 22
1.3 1 30
3.3 3 32
>
Using base R
do.call(rbind, unname(lapply(mget(ls(pattern = "^data\\d+$")), \(x) {
x1 <- subset(x, id %in% data3$id)
v1 <- setdiff(data3$id, x1$id)
if(length(v1) > 0) rbind(x1, cbind(id = v1, x = NA)) else x1
})))
-output
id x
1 1 10
3 3 14
2 3 22
11 1 NA
12 1 30
21 3 32
bind_rows(data1, data2, data3, .id = 'grp')%>%
complete(id, grp)%>%
select(-grp) %>%
filter(id%in%data3$id)
# A tibble: 6 x 2
id x
<dbl> <dbl>
1 1 10
2 1 NA
3 1 30
4 3 14
5 3 22
6 3 32

Flatten data frame and shift rows to columns

I have a data frame like so:
df <- data.frame(
id = c(1, 1, 2, 2),
V1 = c(1:4),
V2 = c(5:8),
V3 = c(9:12))
Printed to the console it looks like this:
# id V1 V2 V3
# 1 1 1 5 9
# 2 1 2 6 10
# 3 2 3 7 11
# 4 2 4 8 12
Now, I would like to transform it to this shape:
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12
How can I do this with base R or the tidyverse?
a possible tidyverse solution
wide <- df %>%
group_by(id) %>%
mutate(obs = row_number()) %>%
gather(var, val, V1:V3) %>%
unite(comb, obs, var) %>%
spread(comb, val)
colnames(wide)[-1] <- paste("V", seq(1,ncol(wide) -1), sep = "")
# A tibble: 2 x 7
# Groups: id [2]
# id V1 V2 V3 V4 V5 V6
#1 1 1 5 9 2 6 10
#2 2 3 7 11 4 8 12
You could do it with e.g. using by.
df2 <- do.call(rbind,
by(df, df$id, function(x) c(x[1, "id"], as.vector(t(x[names(x) != "id"]))))
)
colnames(df2) <- c("id", paste0("V", seq(ncol(df2)-1)))
id V1 V2 V3 V4 V5 V6
1 1 1 5 9 2 6 10
2 2 3 7 11 4 8 12
Base R:
lists <- Map(function(x) data.frame(c(x[1,], x[2,-1])), split(df, df$id))
df2 <- do.call(rbind, lists)
To change the column names:
colnames(df2) <- c("id", paste0("V", seq_along(df2[-1])))
And the result:
# > df2
# id V1 V2 V3 V4 V5 V6
# 1 1 1 5 9 2 6 10
# 2 2 3 7 11 4 8 12

Comparing two unequal data frame and selecting common values in the order of second data frame

I have two data frame as follows:
df1<-data.frame(st=c(1,2,3,4),v1=c(12,14,15,75),v2=c(43,32,12,18))
df1
st v1 v2
1 1 12 43
2 2 14 32
3 3 15 12
4 4 75 18
df2<-data.frame(st=c(1,2,3,4),v1=c(12,24,35,18),v2=c(48,32,121,82),v3=c(53,11,12,75))
df2
st v1 v2 v3
1 1 12 48 53
2 2 24 32 11
3 3 35 121 12
4 4 18 82 75
What i want is to match both the data frame at a "st" column level i.e. for st = 1 in df1 the corresponding values for v1 and v2 are 12 & 43. So if for st= 1 in df2 if any of the variables contain these values then I want to select st, and those values from df2.
So for the above example the output will be
St values
1 12(coming from v1 in df2)
2 32(coming from v2 in df2)
3 12(coming from v3 in df2)
4 18 75(coming from v1 & v3 in df2)
The important thing to note is, in the output data frame the order of selected variables should be as that of df2, as you can see that for st = 4, the values in df1 are 75 & 18 which matches with st = 2 but still the output is 18 and then 75 which is the order in df2. Also the variables in df2 will always be greater than df1.
If I understand you correctly...
Step 0. prepare data
You mentioned that you only want to select rows that fit your conditions, but the sample dataset has at least one match in each row. I tweaked it such that there's no match for St=3, to demonstrate the that the row will not be returned in the result.
df1<-data.frame(st=c(1,2,3,4),v1=c(12,14,15,75),v2=c(43,32,12,18))
df2<-data.frame(st=c(1,2,3,4),v1=c(12,24,35,18),v2=c(48,32,121,82),v3=c(53,11,13,75))
Step 1. combine the datasets
combined.df <- rbind(df1 %>% gather(v, n, -st) %>% mutate(df = "df1"),
df2 %>% gather(v, n, -st) %>% mutate(df = "df2"))
> head(combined.df)
st v n df
1 1 v1 12 df1
2 2 v1 14 df1
3 3 v1 15 df1
4 4 v1 75 df1
5 1 v2 43 df1
6 2 v2 32 df1
Step 2. compare & keep only matched ones from df2
res <- combined.df %>%
group_by(st) %>%
mutate(n = ifelse(df=="df1", n, ifelse(n %in% n[df=="df1"], n, NA))) %>%
ungroup() %>%
filter(df=="df2", !is.na(n)) %>%
arrange(st, v)
# if you just want the values, you can stop here.
> res
# A tibble: 4 × 4
st v n df
<dbl> <chr> <dbl> <chr>
1 1 v1 12 df2
2 2 v2 32 df2
3 4 v1 18 df2
4 4 v3 75 df2
# this part formats the result to follow that of the desired output
res <- res %>%
group_by(st) %>%
summarise(values = paste(as.character(n), collapse = " ")) %>%
ungroup()
> res
# A tibble: 3 × 2
st values
<dbl> <chr>
1 1 12
2 2 32
3 4 18 75
If you use merge function, you can create a unique df with this matches:
new<-merge(df1,df2,by="st")
new
st v1.x v2.x v1.y v2.y v3
1 1 12 43 12 48 53
2 2 14 32 24 32 11
3 3 15 12 35 121 12
4 4 75 18 18 82 75
And if you want, you then can order it in the way you want. For example:
new2<-new[,1:2]
new2$from<-"from v1"
names(new2)<-c("st","value","from")
for(i in 3:ncol(new)){
new3<-new[,c(1,i)]
new3$from<-pasteo("from v",i)
names(new3)<-c("st","value","from")
new2<-rbind(new2,new3)
}
This is not the most efficient way, but if you have few data, it will work

Resources