merge data frames based on non-identical values in R - r

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?

Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.

Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Related

Looping through values to make changes in data frame

I have a code that makes some changes in a dataframe.
value <- iris[1:120,]
cngfunc <- function(day,howmany,howmuch){
shuffled= day[sample(1:nrow(day)), ]
n = as.integer((howmany/100)*nrow(day)) #select percentage of data to be changed
extracted <- shuffled[1:n, ]
extracted$changed <- extracted[,1]*((howmuch/100)+1) #how much the data changes
extracted}
cngfunc(value,10,20)
Now I want to loop through the values of howmany and howmuch.
For example, howmuch <- c(10,20,30,40,50) and howmany <- c(10,20,30,40,50)
So the first result would be for cngfunc(value,10,10), cngfunc(value,10,20),cngfunc(value,10,30)....and cngfunc(value,20,10), cngfunc(value,20,20), and so on such that I'll have 25 different data frame.
Is there a way to do that?
You can do it with expand.grid to get all of the combinations, and the a map2 to create a list of dataframes:
library(tidyverse)
combos <- expand.grid(c(10,20,30,40,50), c(10,20,30,40,50))
result <- map2(combos$Var1, combos$Var2, function(x, y) cngfunc(value, x, y)) %>%
setNames(tidyr::unite(combos, Var, Var1:Var2, sep = "-")$Var)
Not sure where you are getting 120 dataframes from, as 5 * 5 = 25. This should be the general idea though.

Iteratively adding a row containing characters and numbers to a dataframe

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!
You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

Making Calculations on Several Textfiles and making a Dataframe from it R

I am trying to create a table from calculations that I am doing to several text file. I think this might require a loop of some sort, but I am stuck on how to proceed. I have tried different loops but none seem to be working. I have managed to do what I want with one file. Here is my working code:
flare <- read.table("C:/temp/HD3_Bld_CD8_TEM.txt",
header=T)
head(flare[,c(1,2)])
#sum of the freq column, check to see if close to 1
sum(flare$freq)
#Sum of top 10
ten <- sum(flare$freq[1:10])
#Sum of 11-100
to100 <- sum(flare$freq[11:100])
#Sum of 101-1000
to1000 <- sum(flare$freq[101:1000])
#sum of 1001+
rest <- sum(flare$freq[-c(1:1000)])
#place the values of the sum in a table
df <- data.frame(matrix(ncol = 1, nrow = 4))
x <- c("Sum")
colnames(df) <- x
y <- c("10", "11-100", "101-1000", "1000+")
row.names(df) <- y
df[,1] <- c(ten,to100,to1000,rest)
The dataframe ends up looking like this:
>View(df)
Sum
10 0.1745092
11-100 0.2926735
101-1000 0.4211533
1000+ 0.1116640
This is perfect for making a stacked barplot, which I did. However, this is only for one text file. I have several of the same files. All of them have the same column names, so I know that all of them will be using DF$freq column for the calculations. How do I make a table after doing calculations with each file? I want to keep the names of the text files as the sample names so that way when i make a joint stacked barplot all the names will be there. Also, what is the best way to orient the data when writing the new table/dataframe?
I am still new to R, so any help, any explanation would be most welcome. Thank you.
How about something like this, your example is not reproducible so I made a dummy example which you can adjust:
library(tidyverse)
###load ALL your dataframes
test_df_1 <- data.frame(var1 = matrix(c(1,2,3,4,5,6), nrow = 6, ncol = 1))
test_df_1
test_df_2 <- data.frame(var2 = matrix(c(7,8,9,10,11,12), nrow = 6, ncol = 1))
test_df_2
### Bind them into one big wide dataframe
df <- cbind(test_df_1, test_df_2)
### Add an id column which repeats (in your case adjust this to repeat for the grouping you want, i.e replace the each = 2 with each = 10, and each = 4 with each = 100)
df <- df %>%
mutate(id = paste0("id_", c(rep(1, each = 2), rep(2, each = 4))))
### Gather your dataframes into long format by the id
df_gathered <- df %>%
gather(value = value, key = key, - id)
df_gathered
### use group_by to group data by id and summarise to get the sum of each group
df_gathered_sum <- df_gathered %>%
group_by(id, key) %>%
summarise(sigma = sum(value))
df_gathered_sum
You might have some issues with the ID column if your dfs are not equal length so this is only a partial answer. Can do better with a shortened example of your dataset. Can anyone else weigh in on creating an id column? May have sorted it with a couple of edits...
I think I solved it! It gives me the dataframe I want, and from it, I can make the stacked barplot to display the data.
sumfunction <- function(x) {
wow <- read.table(x, header=T)
#Sum of top 10
ten <- sum(wow$freq[1:10])
#Sum of 11-100
to100 <- sum(wow$freq[11:100])
#Sum of 101-1000
to1000 <- sum(wow$freq[101:1000])
#sum of 1001+
rest <- sum(wow$freq[-c(1:1000)])
blah <- c(ten,to100,to1000,rest)
}
library(data.table)
library(tools)
dir = "C:/temp/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
alltogether <- lapply(filenames, function(x) sumfunction(x))
data <- as.data.frame(data.table::transpose(alltogether),
col.names =c("Top 10 ", "From 11 to 100", "From 101 to 1000", "From 1000 on "),
row.names = file_path_sans_ext(basename(filenames)))
This gives me the dataframe that I want. I instead of putting the "top 10, 11-100, 101-1000, 1000+" as the row names, I changed them to column names and instead made the names of each text file become the row names. The file_path_sans_ext(basename(filenames)) makes sure to just keep the file name and remove the extension.
I hope this helps anyone that reads this! thank you again! I love this platform because just being part of this environment gets me thinking and always striving to better myself at R.
If anyone has any input, that would be great!!! <3

How to read and use the dataframes with the different names in a loop?

I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})

Most efficient way of avoiding loops to create data.frame

I have a data.frame which includes the runs scored in each innings of baseball games as a character vector.
I want to create a new data.frame which lists the number of runs in each innings for each game. I can do this with a loop but appreciate that this is too slow for any reasonable number of observations and that the rbind method shown is also not ideal.
The number of innings may vary and an x indicates that the team did not need to bat in 9th inning as game was already won.
library(stringr)
data <- data.frame(gameID=c("a","b","c"),innings=c("002100000","30000000x","10101010101"))
for(i in 1:nrow(data)) {
box <- as.integer(str_split(data$innings[i], "")[[1]])
tempdf <- data.frame(box,id=data$gameID[i])
if(i!=1) {
df <- rbind(df,tempdf)
} else {
df <- tempdf
}
}
This helps a bit (30%):
res <- vector("list", nrow(data))
for(i in seq_along(res))
res[[i]] <- data.frame(box=as.integer(str_split(data$innings[i], "")[[1]]),
id=data$gameID[i])
do.call(rbind, res)
Not sure if this is faster,
library(splitstackshape)
data$innings <- gsub('', ' ', data$innings)
cSplit(data, 'innings', ' ', 'long')
Here's a way using lists with lapply:
library(dplyr) # for bind_rows -- you can also use do.call(rbind, list)
innings <- str_split(data$innings, "")
names(innings) <- data$gameID
innings <- lapply(innings, function(x) data.frame(box = x))
bind_rows(innings, .id = "id")
This should be pretty fast:
## Defined these separately just for readability
innings <- as.character(data$innings) # or use 'stringsAsFactors=FALSE' when defining the data frame
box <- unlist(strsplit(innings, ""))
id <- rep(data$gameID, nchar(innings))
## To get a character matrix back
cbind(box, id)
## To get a data frame back
data.frame(box=box, id=id, stringsAsFactors=FALSE)
Using a matrix is faster, but if you want to have mixed classes use a data frame. Also, for a data frame, it's faster to use characters than factors (thus the stringsAsFactors=FALSE argument). If you want box to be numeric, you can wrap it in as.integer (but then the matrix option wont work, of course).

Resources