Merging two columns into one based on value - r

I have a dataset with two columns containing the following: an indicator number and a hashcode
The only problem is that the columns have the same name, but the value can switch columns.
Now I want to merge the columns and keep the number (I don't care about the hashcode)
I saw this question: Merge two columns into one in r
and I tried the coalesce() function, but that is only for having NA values. Which I don't have. I looked at the unite function, but according to the cheat sheet documentation documentation here that doesn't what I'm looking for
My next try was the filter_at and other filter functions from the dplyr package Documentation here
But that only leaves 150 data points while at the start I have 61k data points.
Code of filter_at I tried:
data <- filter_at(data,vars("hk","hk_1"),all_vars(.>0))
I assumed that a #-string shall not be greater than 0, which seems to be true, but it removes more than intented.
I would like to keep hk or hk_1 value which is a number. The other one (the hash) can be removed. Then I want a new column which only contains those numbers.
Sample data
My data looks like this:
HK|HK1
190|#SP0839
190|#SP0340
178|#SP2949
#SP8390|177
#SP2240|212
What I would like to see:
HK
190
190
178
177
212
I hope this provides an insight into the data. There are more columns like description, etc which makes that 190 at the start are not doubles.

We can replace all the values that start with "#" to NA and then use coalesce to select non-NA value between HK and HK1.
library(dplyr)
df %>%
mutate_all(~as.character(replace(., grepl("^#", .), NA))) %>%
mutate(HK = coalesce(HK, HK1)) %>%
select(HK)
# HK
#1 190
#2 190
#3 178
#4 177
#5 212
data
df <- structure(list(HK = structure(c(4L, 4L, 3L, 2L, 1L), .Label = c("#SP2240",
"#SP8390", "178", "190"), class = "factor"), HK1 = structure(c(2L,
1L, 3L, 4L, 5L), .Label = c("#SP0340", "#SP0839", "#SP2949",
"177", "212"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))

Related

Delete rows using conditional [duplicate]

This question already has answers here:
Regular expressions (RegEx) and dplyr::filter()
(2 answers)
Closed 4 years ago.
I have a data.frame like this:
Client Product
1 VV_Brazil_Jul
2 VV_Brazil_Mar
5 VV_US_Jul
1 VV_JP_Apr
3 VV_CH_May
6 VV_Brazil_Aug
I would like to delete all rows with "Brazil".
You can do this using the grepl function and the ! to find the cases that are not matched:
# Create a dataframe where some cases have the product with Brazil as part of the value
df <- structure(list(Client = c(1L, 2L, 5L, 1L, 3L, 6L),
Product = c("VV_Brazil_Jul", "VV_Brazil_Mar", "VV_US_Jul", "VV_JP_Apr", "VV_CH_May", "VV_Brazil_Aug")),
row.names = c(NA, -6L), class = c("data.table", "data.frame"))
# Display the original dataframe in the Console
df
# Limit the dataframe to cases which do not have Brazil as part of the product
df <- df[!grepl("Brazil", df$Product, ignore.case = TRUE),]
# Display the revised dataframe in the Console
df
You can do the same thing with the tidyverse collection
dplyr::slice(df, -stringr::str_which(df$Product, "Brazil"))

Find character in csv, split cells in R

I have a .csv file with gene names such as "AT1G45150". However, some entries have two gene names connected by an underscore, so they look like this "AT3G01311_ATCG00940" as seen in line 135. Is there a simple command, perhaps with something like gsub that not only finds and eliminates everything in the cell from the underscore on, but also sticks the second gene name in a cell immediately below the one it was found in, in the same column but the next row down? Also want to keep everything that was already in that column, just extend column length to add new members.
"133","AT1G45150","AT1G12200","AT2G25370","AT1G19715","AT2G46830","AT1G20870","AT4G12400","AT1G19660"
"134","AT1G47280","AT1G12410","AT2G26920","AT1G19750","AT2G46850","AT1G21400","AT4G15430","AT1G19690"
"135","AT1G47317","AT1G12530","AT2G27270","AT1G20540","AT3G01311_ATCG00940","AT1G21450","AT5G01970","AT1G19750"
"136","AT1G47420","AT1G12550","AT2G28590","AT1G20570","AT3G03470","AT1G21730","AT1G20800","AT1G19780"
"137","AT1G47500","AT1G12740","AT2G28970","AT1G20580","AT3G03980","AT1G21760","AT3G54740","AT1G19790"
"138","AT1G47570","AT1G12750","AT2G29740","AT1G20610","AT3G05040","AT1G22000","AT4G12400","AT1G19970"
so that it becomes
"133","AT1G45150","AT1G12200","AT2G25370","AT1G19715","AT2G46830","AT1G20870","AT4G12400","AT1G19660"
"134","AT1G47280","AT1G12410","AT2G26920","AT1G19750","AT2G46850","AT1G21400","AT4G15430","AT1G19690"
"135","AT1G47317","AT1G12530","AT2G27270","AT1G20540","AT3G01311","AT1G21450","AT5G01970","AT1G19750"
"136","AT1G47420","AT1G12550","AT2G28590","AT1G20570","ATCG000940","AT1G21730","AT1G20800","AT1G19780"
"137","AT1G47500","AT1G12740","AT2G28970","AT1G20580","AT3G03470","AT1G21760","AT3G54740","AT1G19790"
"138","AT1G47570","AT1G12750","AT2G29740","AT1G20610","AT3G03980","AT1G22000","AT4G12400","AT1G19970"
Thanks for your help!
edit: trying to provide a reproducible example, hope this is helpful:
> dput(droplevels(genes[133:138,]))
structure(list(g99 = structure(1:6, .Label = c("AT1G45150", "AT1G47280",
"AT1G47317", "AT1G47420", "AT1G47500", "AT1G47570"), class = "factor"),
g95 = structure(1:6, .Label = c("AT1G12200", "AT1G12410",
"AT1G12530", "AT1G12550", "AT1G12740", "AT1G12750"), class = "factor"),
y99 = structure(1:6, .Label = c("AT2G25370", "AT2G26920",
"AT2G27270", "AT2G28590", "AT2G28970", "AT2G29740"), class = "factor"),
y95 = structure(1:6, .Label = c("AT1G19715", "AT1G19750",
"AT1G20540", "AT1G20570", "AT1G20580", "AT1G20610"), class = "factor"),
a99 = structure(1:6, .Label = c("AT2G46830", "AT2G46850",
"AT3G01311_ATCG00940", "AT3G03470", "AT3G03980", "AT3G05040"
), class = "factor"), a95 = structure(1:6, .Label = c("AT1G20870",
"AT1G21400", "AT1G21450", "AT1G21730", "AT1G21760", "AT1G22000"
), class = "factor"), e99 = structure(c(3L, 4L, 5L, 1L, 2L,
3L), .Label = c("AT1G20800", "AT3G54740", "AT4G12400", "AT4G15430",
"AT5G01970"), class = "factor"), e95 = structure(1:6, .Label = c("AT1G19660",
"AT1G19690", "AT1G19750", "AT1G19780", "AT1G19790", "AT1G19970"
), class = "factor")), .Names = c("g99", "g95", "y99", "y95",
"a99", "a95", "e99", "e95"), row.names = 133:138, class = "data.frame")
I'm assuming that these genes are part of a bigger data frame with more information about each gene. I'd use tidyr and dplyr. Something like this should work:
library(dplyr)
library(tidyr)
df <-
df %>%
separate(gene, c('first', 'second'), '_') %>% # Make two columns
gather(position, gene, first, second) %>%
filter(!is.na(gene))
I used separate to split the column into two, with the first column containing the first gene and the second column with the second (if it exists). Then I used gather to stack all the genes on top of each other and filter to remove rows from the missing second gene.
Hope this helps!
Now that I've seen your data I've got a new answer. I'm a little confused about what exactly you want in the dataframe, but here's how to do it for a single vector.
library(stringr)
> df$a99
[1] "AT2G46830" "AT2G46850" "AT3G01311_ATCG00940"
[4] "AT3G03470" "AT3G03980" "AT3G05040"
> unlist(str_split(df$a99, '_'))
[1] "AT2G46830" "AT2G46850" "AT3G01311" "ATCG00940" "AT3G03470" "AT3G03980"
[7] "AT3G05040"
This answer assumes you maybe want to keep the data frame structure.
First load the following three packages:
library(stringr); library(purrr); library(dplyr)
Then your data frame looks like:
> genes
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 133 AT1G45150 AT1G12200 AT2G25370 AT1G19715 AT2G46830 AT1G20870 AT4G12400 AT1G19660
2 134 AT1G47280 AT1G12410 AT2G26920 AT1G19750 AT2G46850 AT1G21400 AT4G15430 AT1G19690
3 135 AT1G47317 AT1G12530 AT2G27270 AT1G20540 AT3G01311_ATCG00940 AT1G21450 AT5G01970 AT1G19750
4 136 AT1G47420 AT1G12550 AT2G28590 AT1G20570 AT3G03470 AT1G21730 AT1G20800 AT1G19780
5 137 AT1G47500 AT1G12740 AT2G28970 AT1G20580 AT3G03980 AT1G21760 AT3G54740 AT1G19790
6 138 AT1G47570 AT1G12750 AT2G29740 AT1G20610 AT3G05040 AT1G22000 AT4G12400 AT1G19970
If I was just to attack the V6 variable, I would use the following commands from stringr:
> str_sub(genes$V6, start = 1L,
end = ifelse(is.na(str_locate(genes$V6, '_')[,1]), -1,
str_locate(genes$V6, '_')[, 1] - 1))
[1] "AT2G46830" "AT2G46850" "AT3G01311" "AT3G03470" "AT3G03980" "AT3G05040"
But we want to generalize this to all the variables, in case you want to keep your data frame structure. So use the map function from purrr to go through all the columns in the data frame (you also might be able use lapply in a similar manner, but sometimes it's hard to coerce to a dataframe).
> genes2 <- map(genes, function(x) { str_sub(x, start = 1L,
end = ifelse(is.na(str_locate(x, '_'))[,1], -1,
str_locate(x, '_')[,1] - 1)) })
%>% as_data_frame()
And your data frame then looks like this:
> genes2
Source: local data frame [6 x 9]
V1 V2 V3 V4 V5 V6 V7 V8 V9
(chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
1 133 AT1G45150 AT1G12200 AT2G25370 AT1G19715 AT2G46830 AT1G20870 AT4G12400 AT1G19660
2 134 AT1G47280 AT1G12410 AT2G26920 AT1G19750 AT2G46850 AT1G21400 AT4G15430 AT1G19690
3 135 AT1G47317 AT1G12530 AT2G27270 AT1G20540 AT3G01311 AT1G21450 AT5G01970 AT1G19750
4 136 AT1G47420 AT1G12550 AT2G28590 AT1G20570 AT3G03470 AT1G21730 AT1G20800 AT1G19780
5 137 AT1G47500 AT1G12740 AT2G28970 AT1G20580 AT3G03980 AT1G21760 AT3G54740 AT1G19790
6 138 AT1G47570 AT1G12750 AT2G29740 AT1G20610 AT3G05040 AT1G22000 AT4G12400 AT1G19970

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

split dataset by day and save it as data frame

I have a dataset with 2 months of data (month of Feb and March). Can I know how can I split the data into 59 subsets of data by day and save it as data frame (28 days for Feb and 31 days for Mar)? Preferably to save the data frame in different name according to the date, i.e. 20140201, 20140202 and so forth.
df <- structure(list(text = structure(c(4L, 6L, 5L, 2L, 8L, 1L), .Label = c(" Terpilih Jadi Maskapai dengan Pelayanan Kabin Pesawat cont",
"booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids",
"Can I change for the traveler details because i choose wrongly for the Mr or Ms part",
"cant do it with cards either", "Coming back home AK", "gotta try PNNL",
"Jadwal penerbangan medanjktsblm tangalmasi ada kah", "Me and my Tart would love to flyLoveisintheAir",
"my flight to Bangkok onhas been rescheduled I couldnt perform seat selection now",
"Pls checks his case as money is not credited to my bank acctThanks\n\nCASLTP",
"Processing fee Whatt", "Tacloban bound aboardto get them boats Boats boats boats Tacloban HeartWork",
"thanks I chatted with ask twice last week and told the same thing"
), class = "factor"), created = structure(c(1L, 1L, 2L, 2L, 3L,
3L), .Label = c("1/2/2014", "2/2/2014", "5/2/2014", "6/2/2014"
), class = "factor")), .Names = c("text", "created"), row.names = c(NA,
6L), class = "data.frame")
You don't need to output multiple dataframes. You only need to select/subset them by year&month of the 'created' field. So here are two ways do do that: 1. is simpler if you don't plan on needing any more date-arithmetic
# 1. Leave 'created' a string, just use text substitution to extract its month&date components
df$created_mthyr <- gsub( '([0-9]+/)[0-9]+/([0-9]+)', '\\1\\2', df$created )
# 2. If you need to do arbitrary Date arithmetic, convert 'created' field to Date object
# in this case you need an explicit format-string
df$created <- as.Date(df$created, '%M/%d/%Y')
# Now you can do either a) split
split(df, df$created_mthyr)
# specifically if you want to assign the output it creates to 3 dataframes:
df1 <- split(df, df$created_mthyr)[[1]]
df2 <- split(df, df$created_mthyr)[[2]]
df5 <- split(df, df$created_mthyr)[[3]]
# ...or else b) do a Split-Apply-Combine and perform arbitrary command on each separate subset. This is very powerful. See plyr/ddply documentation for examples.
require(plyr)
df1 <- dlply(df, .(created_mthyr))[[1]]
df2 <- dlply(df, .(created_mthyr))[[2]]
df5 <- dlply(df, .(created_mthyr))[[3]]
# output looks like this - strictly you might not want to keep 'created','created_mthyr':
> df1
# text created created_mthyr
#1 cant do it with cards either 1/2/2014 1/2014
#2 gotta try PNNL 1/2/2014 1/2014
> df2
#3
#Coming back home AK
#4 booking number ZEPLTQ I want to cancel their flight because they can not together with my wife and kids
# created created_mthyr
#3 2/2/2014 2/2014
#4 2/2/2014 2/2014

Resources