Adding new information to a table upon matching rows - r

I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!

Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>

Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.

# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))

Related

Produce all combinations of one element with all other elements within a single column in R

Suppose I have a data frame with a single column that contains letters a, b, c, d, e.
a
b
c
d
e
In R, is it possible to extract a single letter, such as 'a', and produce all possible paired combinations between 'a' and the other letters (with no duplications)? Could the combn command be used in this case?
a b
a c
a d
a e
We can use data.frame
data.frame(col1 = 'a', col2 = setdiff(df1$V1, "a"))
-ouptput
col1 col2
1 a b
2 a c
3 a d
4 a e
data
df1 <- structure(list(V1 = c("a", "b", "c", "d", "e")),
class = "data.frame", row.names = c(NA,
-5L))
Update:
With .before=1 argument the code is shorter :-)
df %>%
mutate(col_a = first(col1), .before=1) %>%
slice(-1)
With dplyr you can:
library(dplyr)
df %>%
mutate(col2 = first(col1)) %>%
slice(-1) %>%
select(col2, col1)
Output:
col2 col1
<chr> <chr>
1 a b
2 a c
3 a d
4 a e
You could use
expand.grid(x=df[1,], y=df[2:5,])
which returns
x y
1 a b
2 a c
3 a d
4 a e

R merge rows with same row names and column name [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 2 years ago.
I have the data like
How can I reshape the data by merge the rows with same rowname and columname like this:
Trust you have allele information missing.
If added as following to the data:
data['allele']=c('a1','a2','a1','a2')
then following will solve the problem easily:
Basically wide to long, followed by joining columns of SNP and allele and then wide again.
library(tidyr)
long=data %>% gather(snp, value, -c(Pedigree,allele))
long_joined=unite(long, snp, c(snp, allele), remove=TRUE)
spread(long_joined, key = snp, value = value)
Maybe you can try aggregate with unlist:
> aggregate(.~P,df,unlist)
P S1.1 S1.2 S2.1 S2.2
1 a C C G G
2 b C C T T
Data
> dput(df)
structure(list(P = c("a", "a", "b", "b"), S1 = c("C", "C", "C",
"C"), S2 = c("G", "G", "T", "T")), class = "data.frame", row.names = c(NA,
-4L))
Solution using dplyr which is part of the tidyverse collection of R packages.
library(dplyr)
Data:
bar <- "Pedigree SNP1 SNP2
'Individual 1' C G
'Individual 1' C G
'Individual 2' C T
'Individual 2' C T"
foo <- read.table(text=bar, header = TRUE)
Code:
foo %>%
group_by(Pedigree) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = SNP1:SNP2, names_prefix = ".a")
Output:
#> # A tibble: 2 x 5
#> # Groups: Pedigree [2]
#> Pedigree SNP1_.a1 SNP1_.a2 SNP2_.a1 SNP2_.a2
#> <fct> <fct> <fct> <fct> <fct>
#> 1 Individual 1 C C G G
#> 2 Individual 2 C C T T
```
Created on 2020-07-26 by the reprex package (v0.3.0)

Organize subgroup strings (text)

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!
A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)
Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

How to add new column to R dataframe based on values in multiple columns

I have created the following dataframe
df<-data.frame("A"<-(1:5), "B"<-c("A","B", "C", "B",'C' ), "C"<-c("A", "A",
"B", 'B', "B"))
names(df)<-c("A", "B", "C")
I am triyng to obtain the duplicated values between columns A and C following output and add the corresponding values in column B . The expected dataframe should be
df2<- "B" "Dupvalues"
1 A
4 B
I am unable to do this. I request some help here
df<-data.frame(A = (1:5),
B = c("A","B", "C", "B",'C' ),
C = c("A", "A","B", 'B', "B"), stringsAsFactors = F)
library(dplyr)
df %>%
filter(B == C) %>% # keep rows when B equals C
group_by(A) %>% # for each A
transmute(DupValues = B) %>% # keep the duplicate value
ungroup() # forget the grouping
# # A tibble: 2 x 2
# A DupValues
# <int> <chr>
# 1 1 A
# 2 4 B
Note that this works if your variables are not factors, but character varaibles.

subset a dataframe based on sum of a column

I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362

Resources