Copy value into NA from one column using criteria from another column - r

I´m new using R and I´ve been struggling using tidyverse.
I have created the following data.frame as example. My original data.frame has 180000 obs and 34 vars.
name <- c("chem1", "chem2", "chem3", "chem4", "chem5")
cas <- c("29331-92-5", "29331-92-6", NA, "29331-92-4", "29331-92-1" )
tib <- tibble(name, cas)
which generate this:
tib
# A tibble: 5 x 2
name cas
<chr> <chr>
1 chem1 29331-92-5
2 chem2 29331-92-6
3 chem3 NA
4 chem4 29331-92-4
5 chem5 29331-92-1
chem3 and chem1 must have same cas value, however the input file came with a NA value for chem3.
I do not know how to copy into the NA cell the cas value belonging to chem1, that is "29331-92-5".
Although I´ve trying using tidyverse but I am happy receiving any base feedback.

What I undestood: the value of chem3 was supposed to be equal to chem1, but your input file came with an error, because in a lot of lines, the values of chem3 are different from chem1.
To correct this, I would make a lookup vector, where the values of each "chem" are the correct values. Whem I make sure that all the values in this lookup vector are correct, I would just past this lookup vector, trough your tib data.frame. So to make this, I will first extract all the current unique values of each "chem" as follow:
library(tidyverse)
group <- tib %>%
group_by(name, cas) %>%
summarise(
count = n()
)
After that, I transform these unique values of cas, into a vector, and them, I name each of these values according to their respective chem. Since the values of "chem3" are incorrect, I need to equal this value, to the value of "chem1" before I proceed.
levels <- group$cas
names(levels) <- group$name
levels["chem3"] <- levels["chem1"]
Now that I correct the value of "chem3" in the lookup vector levels, I just ask R to repeat these values of levels, in the same order as they appear in your tib data.frame, and them I save this result in a new column trough the mutate() function.
tib <- tib %>%
mutate(
correct_cas = levels[tib$name]
)
Resulting this
# A tibble: 5 x 3
name cas correct_cas
<chr> <chr> <chr>
1 chem1 29331-92-5 29331-92-5
2 chem2 29331-92-6 29331-92-6
3 chem3 NA 29331-92-5
4 chem4 29331-92-4 29331-92-4
5 chem5 29331-92-1 29331-92-1

Related

How to return the range of values shared between two data frames in R?

I have several data frames that have the same columns names, and ID
, the following to are the start from and end to of a range and group label from each of them.
What I want is to find which values offrom and to from one of the data frames are included in the range of the other one. I leave an example picture to ilustrate what I want to achieve (no graph is need for the moment)
I thought I could accomplish this using between() of the dplyr package but no. This could be accomplish using if between() returns true then return the maximum value of from and the minimum value of to between the data frames.
I leave example data frames and the results I'm willing to obtain.
a <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(1,500,1000,1,500,1000,1,500,1000),
to=c(400,900,1400,400,900,1400,400,900,1400),group=rep("a",9))
b <- data.frame(ID = c(1,1,1,2,2,2,3,3,3),from=c(300,1200,1900,1400,2800,3700,1300,2500,3500),
to=c(500,1500,2000,2500,3000,3900,1400,2800,3900),group=rep("b",9))
results <- data.frame(ID = c(1,1,1,2,3),from=c(300,500,1200,1400,1300),
to=c(400,500,1400,1400,1400),group=rep("a, b",5))
I tried using this function which will return me the values when there is a match but it doesn't return me the range shared between them
f <- function(vec, id) {
if(length(.x <- which(vec >= a$from & vec <= a$to & id == a$ID))) .x else NA
}
b$fromA <- a$from[mapply(f, b$from, b$ID)]
b$toA <- a$to[mapply(f, b$to, b$ID)]
We can play with the idea that the starting and ending points are in different columns and the ranges for the same group (a and b) do not overlap. This is my solution. I have called 'point_1' and 'point_2' your mutated 'from' and 'to' for clarity.
You can bind the two dataframes and compare the from col with the previous value lag(from) to see if the actual value is smaller. Also you compare the previous lag(to) to the actual to col to see if the max value of the range overlap the previous range or not.
Important, these operations do not distinguish if the two rows they are comparing are from the same group (a or b). Therefore, filtering the NAs in point_1 (the new mutated 'from' column) you will remove wrong mutated values.
Also, note that I assume that, for example, a range in 'a' cannot overlap two rows in 'b'. In your 'results' table that doesn't happen but you should check that in your dataframes.
res = rbind(a,b) %>% # Bind by rows
arrange(ID,from) %>% # arrange by ID and starting point (from)
group_by(ID) %>% # perform the following operations grouped by IDs
# Here is the trick. If the ranges for the same ID and group (i.e. 1,a) do
# not overlap, when you mutate the following cols the result will be NA for
# point_1.
mutate(point_1 = ifelse(from <= lag(to), from, NA),
point_2 = ifelse(lag(to)>=to, to, lag(to)),
groups = paste(lag(group), group, sep = ',')) %>%
filter(! is.na(point_1)) %>% # remove NAs in from
select(ID,point_1, point_2, groups) # get the result dataframe
If you play a bit with the code, not using the filter() and select() you will see how that's work.
> res
# A tibble: 5 x 4
# Groups: ID [3]
ID point_1 point_2 groups
<dbl> <dbl> <dbl> <chr>
1 1 300 400 a,b
2 1 500 500 b,a
3 1 1200 1400 a,b
4 2 1400 1400 a,b
5 3 1300 1400 a,b

Find nearest entry between dataframes in R

I have two dataframes that I want to compare. I already know that the values in dataframe one (df1$BP) are not within the range of values in dataframe two (df2$START and df2$STOP), but I want to return the row in dataframe two where df2$START or df2$STOP is closest in value to df1$BP, where the "Chr" column matches between datasets (df1$Chr, df2$Chr).
I have managed to do this (see bottom of Q), but it seems SERIOUSLY unwieldy, and I wondered if there was a more parsimonious manner of achieving the same thing.
So for dataframe one (df1), we have:
df1=data.frame(SNP=c("rs79247094","rs13325007"),
Chr=c(2,3),
BP=c(48554955,107916058))
df1
SNP Chr BP
rs79247094 2 48554955
rs13325007 3 107916058
For dataframe two, we have:
df2=data.frame(clump=c(1,2,3,4,5,6,7,8),
Chr=c(2,2,2,2,3,3,3,3),
START=c(28033538,37576136,58143438,60389362,80814042,107379837,136288405,161777035),
STOP=c(27451538,36998607,57845065,60242162,79814042,107118837,135530405,161092491))
df2
Clump Chr START STOP
1 2 28033538 27451538
2 2 37576136 36998607
3 2 58143438 57845065
4 2 60389362 60242162
5 3 80814042 79814042
6 3 107379837 107118837
7 3 136288405 135530405
8 3 161777035 161092491
I am interested in returning which START/STOP values are closest to BP. Ideally, I could return the row, and what the difference between BP and START or STOP is (df3$Dist), like:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
I have found similar questions, for example: Return rows establishing a "closest value to" in R
But these are finding the closest values based on a fixed value, rather than a value that changes (and matching on the Chr column).
My long winded method is:
df3<-right_join(df1,df2,by="Chr")
to give me all of the combinations of df1 and df2 together.
df3$start_dist<-abs(df3$START-df3$BP)
to create a column with the absolute difference between START and BP
df3$stop_dist<-abs(df3$STOP-df3$BP)
to create a column with the absolute difference between STOP and BP
df3$dist.compare<-ifelse(df3$start_dist<df3$stop_dist,df3$start_dist,df3$stop_dist)
df3<-df3[with(df3,order(SNP,"dist.compare")),]
to create a column (dist.compare) which prints the smallest difference between BP and START or STOP (and the re-order by that column)
df3<- df3 %>% group_by(SNP) %>% mutate(Dist = first(dist.compare))
to create a column (Dist) which prints the minimum value from df3$dist.compare
df3<-df3[which(df3$dist.compare==df3$Dist),c("clump","Chr","START","STOP","SNP","BP","Dist")]
df3<-df3[order(df3$clump),]
to only print rows where dist.compare matches Dist (so the minimum value), and drop the intermediate columns, and tidy up by re-ordering by clump. Now that gets me to where I want to be:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
But I feel like it is very very convoluted, and wondered if anyone had any tips on how to refine that process?
thanks in advance
Following the logic you've laid out in your syntax, here is a dplyr solution that is a bit cleaner:
right_join your dataframes
Create a variable dist.compare based on absolute values
Group by SNP
Filter to keep the smallest distance
Select variables in the order you'd like for your final dataframe. Note you can rename variables in a dplyr::select statement (Dist = dist.compare)
Order values by clump
library(dplyr)
df3 <- right_join(df1, df2, by = "Chr") %>%
mutate(dist.compare = ifelse(abs(START - BP) < abs(STOP - BP), abs(START - BP), abs(STOP - BP))) %>%
group_by(SNP) %>%
filter(dist.compare == min(dist.compare)) %>%
select(clump, Chr, START, STOP, SNP, BP, Dist = dist.compare) %>%
arrange(clump)
This gives us:
clump Chr START STOP SNP BP Dist
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 3 2 58143438 57845065 rs79247094 48554955 9290110
2 6 3 107379837 107118837 rs13325007 107916058 536221

which pattern from string another string matches

I have a column variable that I want to split into three factor variables. There are the factor variables I want to create:
goal<-c('newref', 'meow', 'woof')
area<-c('eco', 'social', 'bank')
fr<-c('demo', 'hist', 'util')
And the current variable looks more or less like that:
code<-c('goal\\\\meow', 'area\\\\bank', 'area\\\\bank', 'fr\\\\utilitarian', 'fr\\\\history')
And let's say the dataframe is something like that
df<-data.frame(var1=c(1,2,3,4,5), var2=c('a', 'b', 'c', 'd', 'e'), code=code)
So I would like to create 3 new columns, one per each factor variable, and use a regular expression that detected what it belongs to. So for example row number one should look as follows:
row1<-data.frame(var1=1, var2=c('a'), code=c('goal\\\\meow'), goal=2, area=NA, fr=NA)
Also note that the value of the factor variables is an abbreviation of the value in code (eg, history / hist).
The database is likely to have 10000 entries, so I would really appreciate any hints on this.
Thank you!
We can define a function that finds the position of the factor variable that, when used as a regular expression, finds a match in the code column:
find_match <- function(code, matches) {
apply(sapply(matches, grepl, code), 1, match, x=T)
}
If there is no match, this function returns NA for that row.
Next, we can simply use mutate from dplyr to add each column of factors:
df %>% mutate(goal = find_match(code, goal),
area = find_match(code, area),
fr = find_match(code, fr))
Which gives:
var1 var2 code goal area fr
1 1 a goal\\\\meow 2 NA NA
2 2 b area\\\\bank NA 3 NA
3 3 c area\\\\bank NA 3 NA
4 4 d fr\\\\utilitarian NA NA 3
5 5 e fr\\\\history NA NA 2
Doing this with tidyverse tools like the pipe %>% and dplyr:
Separate breaks up the code column into two with the separator you specify.
Because "\" is a special character in regex you have to escape each \ you want to look for with another .
Spread converts it from tall form to wide form as you needed.
library(dplyr)
df %>%
separate(code, into = c("colName", "value"), sep = "\\\\\\\\") %>%
spread(colName, value)

using variable column names in dplyr summarise

I found this question already asked but without proper answer. R using variable column names in summarise function in dplyr
I want to calculate the difference between two column means, but the column name should be provided by variables... So far I found only the function as.name to provide column names as text, but this somehow doesn't work here...
With fix column names it works.
x <- c('a','b')
df <- group_by(data.frame(a=c(1,2,3,4), b=c(2,3,4,5), c=c(1,1,2,2)), c)
df %>% summarise(mean(a) - mean(b))
With variable columns, it doesn't work
df %>% summarise(mean(x[1]) - mean(x[2]))
df %>% summarise(mean(as.name(x[1])) - mean(as.name(x[2])))
Since this was asked already 3 years ago and dplyr is under good development, I am wondering if there is an answer to this now.
You can use base::get:
df %>% summarise(mean(get(x[1])) - mean(get(x[2])))
# # A tibble: 2 x 2
# c `mean(a) - mean(b)`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
get will search in current environment by default.
As the error message says, mean expects a logical or numeric object, as.name returns a name:
class(as.name("a")) # [1] "name"
You could evaluate your name, that would work as well :
df %>% summarise(mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2]))))
# # A tibble: 2 x 2
# c `mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2])))`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
This is not a direct answer to your question but maybe could be useful for other people reading your post:
It could be easier to use variable columns directly, like
df %>% summarise(someName = mean(.[[1]]) - mean(.[[2]]))
############ which is the same as ############
df %>% summarise(someName = mean(.[,1,drop=T]) - mean(.[,2,drop=T]))
Note that drop=T is because when using just single square bracket the result preserves the class (in this case class( . ) = data.frame) and this isn't what we want (columns must be given in vector form to the summarise function)

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources