Paste value from for loop into data frame R - r

I have two dataframes in R, recurrent and L1HS. I am trying to find a way to do this:
If a sequence in recurrent matches sequence in L1HS, paste a value from a column in recurrent into new column in L1HS.
The recurrent dataframe looks like this:
> head(recurrent)
chr start end X Y level unique
1: chr4 56707846 56708347 0 38 03 chr4_56707846_56708347
2: chr1 20252181 20252682 0 37 03 chr1_20252181_20252682
3: chr2 224560903 224561404 0 37 03 chr2_224560903_224561404
4: chr5 131849595 131850096 0 36 03 chr5_131849595_131850096
5: chr7 46361610 46362111 0 36 03 chr7_46361610_46362111
6: chr1 20251169 20251670 0 36 03 chr1_20251169_20251670
The L1HS dataset contains many columns containing genetic sequence basepairs and a column "Sequence" that should hopefully have some matches with "unique" in the recurrent data frame, like so:
> head(L1HS$Sequence)
"chr1_35031657_35037706"
"chr1_67544575_67550598"
"chr1_81404889_81410942"
"chr1_84518073_84524089"
"chr1_87144764_87150794"
I know how to search for matches using
test <- recurrent$unique %in% L1HS$Sequence
to get the Booleans:
> head(test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
But I have a couple of problems from here. If the sequence is found, I want to copy the "level" value from the recurrent dataset to the L1HS dataset in a new column. For example, if the sequence "chr4_56707846_56708347" from the recurrent data was found in the full-length data, I'd like the full-length data frame to look like:
Sequence level other_columns
chr4_56707846_56708347 03 gggtttcatgaccc....
I was thinking of trying something like:
for (i in L1HS){
if (recurrent$unique %in% L1HS$Sequence{
L1HS$level <- paste(recurrent$level[i])}
}
but of course this isn't working and I can't figure it out.
I am wondering what the best approach is here! I'm wondering if merge/intersect/apply might be easier/better, or just what best practice might look like for a somewhat simple question like this. I've found some similar examples for Python/pandas, but am stuck here.
Thanks in advance!

You can do a simple left_join to add level to L1HS with dplyr.
library(dplyr)
L1HS %>%
left_join(., recurrent %>% select(unique, level), by = c("Sequence" = "unique"))
Or with merge:
merge(x=L1HS,y=recurrent[, c("unique", "level")], by.x = "Sequence", by.y = "unique",all.x=TRUE)
Output
Sequence level
1 chr1_35031657_35037706 4
2 chr1_67544575_67550598 2
3 chr1_81404889_81410942 NA
4 chr1_84518073_84524089 3
5 chr1_87144764_87150794 NA
*Note: This will still retain all the columns in L1HS. I just didn't create any additional columns in the example data below.
Data
recurrent <- structure(list(chr = c("chr4", "chr1", "chr2", "chr5", "chr7",
"chr1"), start = c(56707846L, 20252181L, 224560903L, 131849595L,
46361610L, 20251169L), end = c(56708347L, 20252682L, 224561404L,
131850096L, 46362111L, 20251670L), X = c(0L, 0L, 0L, 0L, 0L,
0L), Y = c(38L, 37L, 37L, 36L, 36L, 36L), level = c(3L, 2L, 3L,
3L, 3L, 4L), unique = c("chr4_56707846_56708347", "chr1_67544575_67550598",
"chr2_224560903_224561404", "chr5_131849595_131850096", "chr1_84518073_84524089",
"chr1_35031657_35037706")), class = "data.frame", row.names = c(NA,
-6L))
L1HS <- structure(list(Sequence = c("chr1_35031657_35037706", "chr1_67544575_67550598",
"chr1_81404889_81410942", "chr1_84518073_84524089", "chr1_87144764_87150794"
)), class = "data.frame", row.names = c(NA, -5L))

Related

How can I combine rows of data when their character values are equal? (R)

I have a dataset that was recorded by observation(each observation has its own row of data). I am looking to combine/condense these rows by the plant they were found on - currently a character variable. All other columns are numerical vales.
EX:
This is the raw data
|Sci_Name|Honeybee_count|Other_bee_Obsevrved|Stem_count|
|---|---|---|---|
|Zizia aurea|1|5|10|
|Asclepias viridiflora|15|1|3|
|Viola unknown|0|0|4|
|Zizia aurea|0|2|6|
|Zizia aurea|3|6|3|
|Asclepias viridiflora|8|2|17|
and I want:
Sci_Name
Honeybee_count
Other_bee_Obsevrved
Stem_count
Zizia aurea
4
13
19
Asclepias viridiflora
23
3
20
Viola unknown
0
0
4
I am currently pulling this data from a CSV already in table form. I have been attempting to create a new table/data frame with one entry of each plant species, and blanks/0s for each other variable, which I can then use to c-binding the two together. This, however, has been clunky at best and I am having trouble figuring out how to have each row check itself. I am open to any approach, let me know what you think!
Thanks :D
We can use the formula method in aggregate from base R. On the rhs of the ~, specify the grouping variable and on the lhs, use . for denoting the rest of the variables. Specify the FUN as sum and it will do the column wise sum by group
aggregate(. ~ Sci_Name, df1, sum)
-output
Sci_Name Honeybee_count Other_bee_Obsevrved Stem_count
1 Asclepias viridiflora 23 3 20
2 Viola unknown 0 0 4
3 Zizia aurea 4 13 19
data
df1 <- structure(list(Sci_Name = c("Zizia aurea", "Asclepias viridiflora",
"Viola unknown", "Zizia aurea", "Zizia aurea", "Asclepias viridiflora"
), Honeybee_count = c(1L, 15L, 0L, 0L, 3L, 8L), Other_bee_Obsevrved = c(5L,
1L, 0L, 2L, 6L, 2L), Stem_count = c(10L, 3L, 4L, 6L, 3L, 17L)),
class = "data.frame", row.names = c(NA,
-6L))

Merging two columns into one based on value

I have a dataset with two columns containing the following: an indicator number and a hashcode
The only problem is that the columns have the same name, but the value can switch columns.
Now I want to merge the columns and keep the number (I don't care about the hashcode)
I saw this question: Merge two columns into one in r
and I tried the coalesce() function, but that is only for having NA values. Which I don't have. I looked at the unite function, but according to the cheat sheet documentation documentation here that doesn't what I'm looking for
My next try was the filter_at and other filter functions from the dplyr package Documentation here
But that only leaves 150 data points while at the start I have 61k data points.
Code of filter_at I tried:
data <- filter_at(data,vars("hk","hk_1"),all_vars(.>0))
I assumed that a #-string shall not be greater than 0, which seems to be true, but it removes more than intented.
I would like to keep hk or hk_1 value which is a number. The other one (the hash) can be removed. Then I want a new column which only contains those numbers.
Sample data
My data looks like this:
HK|HK1
190|#SP0839
190|#SP0340
178|#SP2949
#SP8390|177
#SP2240|212
What I would like to see:
HK
190
190
178
177
212
I hope this provides an insight into the data. There are more columns like description, etc which makes that 190 at the start are not doubles.
We can replace all the values that start with "#" to NA and then use coalesce to select non-NA value between HK and HK1.
library(dplyr)
df %>%
mutate_all(~as.character(replace(., grepl("^#", .), NA))) %>%
mutate(HK = coalesce(HK, HK1)) %>%
select(HK)
# HK
#1 190
#2 190
#3 178
#4 177
#5 212
data
df <- structure(list(HK = structure(c(4L, 4L, 3L, 2L, 1L), .Label = c("#SP2240",
"#SP8390", "178", "190"), class = "factor"), HK1 = structure(c(2L,
1L, 3L, 4L, 5L), .Label = c("#SP0340", "#SP0839", "#SP2949",
"177", "212"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))

how can I group based on similarity in strings

I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

How do summarize this data table with dplyr, then run a chisq.test (or similar) on the results and loop it all into one neat function?

This question was embedded in another question I asked here, but as it goes beyond the scope of what I wanted to know in the initial inquiry, I thought it might deserve a separate thread.
I've been trying to come up with a solution for this problem based on the answers I have received here and here using dplyr and the functions written by Khashaa and Jaap.
Using the solutions provided to me (especially from Jaap), I have been able to summarize the raw data I received into a matrix-looking data table
dput(SO_Example_v1)
structure(list(Type = structure(c(3L, 1L, 2L), .Label = c("Community",
"Contaminant", "Healthcare"), class = "factor"), hosp1_WoundAssocType = c(464L,
285L, 24L), hosp1_BloodAssocType = c(73L, 40L, 26L), hosp1_UrineAssocType = c(75L,
37L, 18L), hosp1_RespAssocType = c(137L, 77L, 2L), hosp1_CathAssocType = c(80L,
34L, 24L), hosp2_WoundAssocType = c(171L, 115L, 17L), hosp2_BloodAssocType = c(127L,
62L, 12L), hosp2_UrineAssocType = c(50L, 29L, 6L), hosp2_RespAssocType = c(135L,
142L, 6L), hosp2_CathAssocType = c(95L, 24L, 12L)), .Names = c("Type",
"hosp1_WoundAssocType", "hosp1_BloodAssocType", "hosp1_UrineAssocType",
"hosp1_RespAssocType", "hosp1_CathAssocType", "hosp2_WoundAssocType",
"hosp2_BloodAssocType", "hosp2_UrineAssocType", "hosp2_RespAssocType",
"hosp2_CathAssocType"), class = "data.frame", row.names = c(NA,
-3L))
Which looks as follows
require(dplyr)
df <- tbl_df(SO_Example_v1)
head(df)
Type hosp1_WoundAssocType hosp1_BloodAssocType hosp1_UrineAssocType
1 Healthcare 464 73 75
2 Community 285 40 37
3 Contaminant 24 26 18
Variables not shown: hosp1_RespAssocType (int), hosp1_CathAssocType (int), hosp2_WoundAssocType
(int), hosp2_BloodAssocType (int), hosp2_UrineAssocType (int), hosp2_RespAssocType (int),
hosp2_CathAssocType (int)
The column Type is the type of bacteria, the following columns represent where they were cultured. The digits represent the number of times the respective type of bacteria were detected.
I know what my final table should look like, but until now I have been doing it step by step for each comparison and variable and there must undoubtedly be a way to do this by piping multiple functions in dplyr - but alas, I have not found the answer on SO to this.
Example of what final table should look like
Wound
Type n Hospital 1 (%) n Hospital 2 (%) p-val
Healthcare associated bacteria 464 (60.0) 171 (56.4) 0.28
Community associated bacteria 285 (36.9) 115 (38.0) 0.74
Contaminants 24 (3.1) 17 (5.6) 0.05
Where the first grouping variable "Wound" is then subsequently replaced by "Urine", "Respiratory", ... and then there's a final column termed "All/Total", which is the total number of times each variable in the rows of "Type" was found and summarized across Hospital 1 and 2 and then compared.
What I have done until now is the following and very tedious, as it's calculated "by hand" and then I poulate the table with all of the results manually.
### Wound cultures & healthcare associated (extracted manually)
# hosp1 464 (yes), 309 (no), 773 wound isolates in total; (% = 464 / 309 * 100)
# hosp2 171 (yes), 132 (no), 303 would isolates in total; (% = 171 / 303 * 100)
### Then the chisq.test of my contingency table
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
I appreciate that if I run a piped dplyr on the raw data.frame I won't be able to get the exact formatting of my desired table, but there must be a way to at least automate all the steps here and put the results together in a final table that I can export as a .csv file and then just do some final column editing etc.?
Any help is greatly appreciated.
It's ugly, but it works (Sam in the comments is right that this whole issue should probably be addressed by adjusting your data to a clean format before analysing, but anyway):
Map(
function(x,y) {
out <- cbind(x,y)
final <- rbind(out[1,],colSums(out[2:3,]))
chisq.test(final,correct=FALSE)
},
SO_Example_v1[grepl("^hosp1",names(SO_Example_v1))],
SO_Example_v1[grepl("^hosp2",names(SO_Example_v1))]
)
#$hosp1_WoundAssocType
#
# Pearson's Chi-squared test
#
#data: final
#X-squared = 1.16, df = 1, p-value = 0.2815
# etc etc...
Matches your intended result:
chisq.test(cbind(c(464,309),c(171,132)),correct=FALSE)
#
# Pearson's Chi-squared test
#
#data: cbind(c(464, 309), c(171, 132))
#X-squared = 1.16, df = 1, p-value = 0.2815

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

Resources