Extracting values from R table within grouped values - r

I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?

As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA

You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598

Related

Merging / combining columns is not working at all

I cant for the life of me figure out a way of getting this working without changing the class of the columns or getting a random level that isn't even in the original columns!
I have data that looks like this:
data <- structure(list(WHY = structure(1:4, .Label = c("WHY1", "WHY2",
"WHY3", "WHY4"), class = "factor"), HELP1 = structure(c(3L, NA,
1L, 2L), .Label = c("1", "2", "D/A"), class = "factor"), HELP2 = c(NA,
2L, NA, NA)), class = "data.frame", row.names = c(NA, -4L))
What I want to do:
If HELP 2 IS NOT NA & IF HELP1 is D/A then merge columns WITHOUT changing class.
Here is what I tried:
data$HELP3 <-
ifelse(
!is.na(data$HELP2) &
data$HELP1 == "D/A",
data$HELP1, data$HELP2)
Result:
data
WHY HELP1 HELP2 HELP3
1 WHY1 D/A NA NA
2 WHY2 <NA> 2 NA
3 WHY3 1 NA NA
4 WHY4 2 NA NA
>
I would be so very very very grateful for any help with this. I have been on stack overflow for 5 hours now and no closer to making this work :( I am not that hot with dyplr so a base r or anything else would be wonderful!
Since HELP2 and HELP1 have different class and ifelse also has issues to return vector of factor class. You could however, do this without ifelse and without changing the classes of columns.
data$HELP3 <- data$HELP1
inds <- (!is.na(data$HELP2)) & data$HELP1 == "D/A"
data$HELP3[inds] <- data$HELP2[inds]

how can I group based on similarity in strings

I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

How to unwind column contents into new columns based on condition in another column in R

I have a dataframe called my mydf. I want to split the contents in columns ASM and GPM based on the format given in the FORMAT column and get the result. So basically, there will be as many columns for ASM and GPM columns as there are total unique elements (i.e. 5 different unique elements) in FORMAT column separated by : to unwind in the result. Then need to place the right value in the right columns (with .GT, .FT, and so on) as indicated in FORMAT column.
mydf <- structure(list(`#CHROM` = c(1L, 1L, 1L), POS = c(10490L, 10493L,
10494L), FORMAT = c("GT:FT:GQ", "GT:PS:GL", "GT:PS:FT"), ASM = c("1/1:TRUE:4,2,333",
"./.:.:.", "0/1:.:VQLOW"), GPM = c("./.:.:.", "1/1:4:2,233",
"0/1:22:VQHIGH")), .Names = c("#CHROM", "POS", "FORMAT", "ASM",
"GPM"), class = "data.frame", row.names = c(NA, -3L))
result:
result <- structure(list(`#CHROM` = c(1L, 1L, 1L), POS = c(10490L, 10493L,
10494L), FORMAT = c("GT:FT:GQ", "GT:PS:GL", "GT:PS:FT"), ASM.GT = c("1/1",
"./.", "0/1"), ASM.FT = c("TRUE", NA, "VQLOW"), ASM.GQ = c("4,2,333",
NA, NA), ASM.PS = c(NA, NA, NA), ASM.GL = c(NA, NA, NA), GPM.GT = c("./.",
"1/1", "0/1"), GPM.FT = c(NA, NA, "VQHIGH"), GPM.GQ = c(NA, NA,
NA), GPM.PS = c(NA, 4L, 22L), GPM.GL = c(NA, 2233L, NA)), .Names = c("#CHROM",
"POS", "FORMAT", "ASM.GT", "ASM.FT", "ASM.GQ", "ASM.PS", "ASM.GL",
"GPM.GT", "GPM.FT", "GPM.GQ", "GPM.PS", "GPM.GL"), class = "data.frame", row.names = c(NA,
-3L))
Since it appears that the number of values in each of the columns to be split is the same, we can take advantage of the ability of dcast in "data.table" to handle multiple value.vars.
The splitting can be done by cSplit from my "splitstackshape" package.
library(splitstackshape)
dcast(cSplit(mydf, c("FORMAT", "ASM", "GPM"), ":", "long"),
`#CHROM` + POS ~ FORMAT, value.var = c("ASM", "GPM"))
# #CHROM POS ASM_FT ASM_GL ASM_GQ ASM_GT ASM_PS GPM_FT GPM_GL GPM_GQ GPM_GT GPM_PS
# 1: 1 10490 TRUE NA 4,2,333 1/1 NA . NA . ./. NA
# 2: 1 10493 NA . NA ./. . NA 2,233 NA 1/1 4
# 3: 1 10494 VQLOW NA NA 0/1 . VQHIGH NA NA 0/1 22
Note that "#CHROM" is a very R-unfriendly column name since the # is the comment character.
If you need to add back in the "FORMAT" column, add a [, FORMAT:= mydf$FORMAT][] to the end of the dcast above.
I'm presuming that you can handle further cleaning from here (for example, replacing . with NA and removing the thousand comma separator wherever it appears.

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

Frequency of items in a list in R

I have a very large csv file. I want to calculate the frequency of the items in the second column in order to graph histogram. An example of my data:
0010,10.1.1.16
0011,10.2.2.10
0012,192.168.2.61
0013,192.168.173.19
0014,10.2.2.10
0015,10.2.2.10
0016,192.168.2.61
I have used the below:
inFile <- read.csv("file.csv")
summary(inFile)
hist(inFile$secondCol)
output of summary:
X0010 X10.1.1.16
Min. :11.00 10.2.2.10 :3
1st Qu.:12.25 192.168.173.19:1
Median :13.50 192.168.2.61 :2
Mean :13.50
3rd Qu.:14.75
Max. :16.00
Because the file is very large, I'm not getting the right histogram. Any suggestions?
Just use table.
DF <- structure(list(V1 = 10:16, V2 = structure(c(1L, 2L, 4L, 3L, 2L,
2L, 4L), .Label = c("10.1.1.16", "10.2.2.10",
"192.168.173.19", "192.168.2.61"), class = "factor")),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -7L))
table(DF$V2)
# 10.1.1.16 10.2.2.10 192.168.173.19 192.168.2.61
# 1 3 1 2
If you want a data.frame out of this, you can use as.data.frame:
as.data.frame(table(DF$V2))
# Var1 Freq
# 1 10.1.1.16 1
# 2 10.2.2.10 3
# 3 192.168.173.19 1
# 4 192.168.2.61 2
Since you say you want a histogram, this can be done directly using ggplot2 without having to get the counts first as follows:
require(ggplot2)
ggplot(data = DF, aes(x = V2)) + geom_histogram(aes(y = ..count..))
We could have also done a as.numeric() on the column.
typeof(data$hourofcrime)
# gives me a list
#> typeof(data$hourofcrime)
#[1] "list"
hour_crime_rate <- as.numeric(data$hourofcrime)
hist(hour_crime_rate)

Resources