I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))
Related
I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame.
This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated.
Cheers!
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA
I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)
Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2
I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?
I have a set of data and a loop containing numerous calculations for the data set, where the individual components of the set are split into a subset and cycled through one by one. However I need to be able to execute the same calculations across the original data set as a whole first.
For a fictional data set called masterdata with 3 components (column D1) and numerous variables (X2-X10) as such:
# masterdata
# D1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# A NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# C NA NA NA NA NA NA NA NA NA
# A NA NA NA NA NA NA NA NA NA
# B NA NA NA NA NA NA NA NA NA
# A NA NA NA NA NA NA NA NA NA
A loop is in place to split off a subset for component A, perform the calculations, output the results and then repeat this for B and C:
Component.List = c("A", "B", "C")
for(k in 1:length(Component.List)) {
subdata = subset(masterdata, D1 == Component.List[k])
# Numerous calculations performed on "subdata" within the loop
}
# End of loop
What I am trying to do is initially perform the same numerous calculations against the whole of masterdata and then start looping through the individual components.
Part of the output from the calculations is that two vectors that are created are placed into the first column of the data frames created just prior to executing the loop:
# Prior to the start of the loop two frames below created
Components = 3 # In this example 3 components in column D1 - "A", "B", "C"
Result.Frame.V1 = as.data.frame(matrix(0, nrow = 200, ncol = Components))
Result.Frame.V2 = as.data.frame(matrix(0, nrow = 200, ncol = Components))
# Loop runs and contains all of the calculations and within the calculations the last two
# lines below place two vectors generated into the the kth columns of the frames.
Result.Frame.V1[,k] = V1.Result
Result.Frame.V2[,k] = V2.Result
# First run of the loop for "A" will place the outputs in the 1st columns
# Second run of the loop for "B" will place the outputs in the 2nd columns, etc.
# With the expansion to also calculate against the whole group, the above data frames
# would be expanded to an extra column that would hold the result vector for the whole
# masterdata run through the calculations
My initial theoretical solution is to write every calculation in the loop once for masterdata and then have the above loop, however the calculations are hundreds of lines of code!
Is it possible to incorporate into the For loop a way to calculate for the original data and then continue cycling through the components?
It seems like dplyr would solve this elegantly, among the other options
For the whole data:
library(dplyr)
masterdata %>%
summarise(result = your_function(arg1 = X1, arg2 = X2, ...))
For each component, just add group_by
masterdata %>%
group_by(D1) %>%
summarise(result = your_function(arg1 = X1, arg2 = X2, ...))
If you are outputting dataframes then creating a function that performs your calculations when passed a dataframe, and outputs a dataframe will be key. In the below example the function is called your_function().
For simplicity a Three stage process is used, first to create the output dataframe on the overall dataset then lapply to perform the same calculations on the sub datasets. The sub datasets are then bound together into a single dataframe before finally being combined with the output of the full dataset.
note: I created a new variable called "Subset" so that the outputs are all identifiable as belonging to each distinct set.
library(dplyr)
FullSet <- your_function(masterdata) %>% mutate(Subset = "Full")
SubSets <- lapply(unique(D1), function(n){
masterdata %>% filter(D1 == n) %>%
your_function(.) %>% mutate(Subset = n)
}) %>% bind_rows()
FinalSet <- bind_rows(FullSet, SubSets)
if you want to run the process in parallel for speed then use
mclapply(unique(D1), function..., mc.cores=detectCores())
I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)
You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))
#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.
Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)
I have this function which calculates the consonanceScore of a book. First I import the phonetics dictionary from CMU (which forms a dataframe of about 134000 rows and 33 column variables; any row in the CMUdictionary is basically of the form CLOUDS K L AW1 D Z. The first column has the words, and the remaining columns have their phonetic equivalents). After getting the CMU dictionary, I parse a book into a vector containing all the words; max-length of any one book (so far): 218711 . Each word's phonetics are compared with the phonetics in the consecutive word, and the consecutive+1 word. The TRUE match values are then combined into a sum. The function I have is this:
getConsonanceScore <- function(book, consonanceScore, CMUdict) {
for (i in 1:((length(book)) - 2)) {
index1 <- replaceIfEmpty(which (toupper(book[i]) == CMUdict[,1]))
index2 <- replaceIfEmpty(which (toupper(book[i + 1]) == CMUdict[,1]))
index3 <- replaceIfEmpty(which (toupper(book[i + 2]) == CMUdict[,1]))
word1 <- as.character(CMUdict[index1, which(CMUdict[index1,] != "")])
word2 <- as.character(CMUdict[index2, which(CMUdict[index2,] != "")])
word3 <- as.character(CMUdict[index3, which(CMUdict[index3,] != "")])
consonanceScore <- sum(word1 %in% word2)
consonanceScore <- consonanceScore + sum(word1 %in% word3)
consonanceScore <- consonanceScore / length(book)
}
return(consonanceScore)
}
A replaceIfEmpty function basically just returns the index for a dummy value (that has been declared in the last row of the dataframe) if there is no match found in the CMU dictionary for any word in the book. It goes like this:
replaceIfEmpty <- function(x) {
if (length(x) > 0)
{
return (x)
}
else
{
x = 133780
return(x)
}
}
The issue that I am facing is that getConsonanceScore function takes a lot of time. So much so that in the loop, I had to divide the book length by 1000 just to check if the function was working alright. I am new to R, and would really be grateful for some help on making this function more efficient and consume less time, are there any ways of doing this? (I have to later call this function on possibly 50-100 books) Thanks a lot!
I've re-read recently your question, comments and #wibeasley's answer and got that didn't understand everything correctly. Now it have become more clear, and I'll try to suggest something useful.
First of all, we need a small example to work with. I've made it from the dictionary in your link.
dictdf <- read.table(text =
"A AH0
CALLED K AO1 L D
DOG D AO1 G
DOGMA D AA1 G M AH0
HAVE HH AE1 V
I AY1",
header = F, col.names = paste0("V", 1:25), fill = T, stringsAsFactors = F )
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
bookdf <- data.frame(words = c("I", "have", "a", "dog", "called", "Dogma"))
# words
# 1 I
# 2 have
# 3 a
# 4 dog
# 5 called
# 6 Dogma
Here we read data from dictionary with fill = T and manually define number of columns in data.frame by setting col.names. You may make 50, 100 or some other number of columns (but I don't think there are so long words in the dictionary). And we make a bookdf - a vector of words in the form of data.frame.
Then let's merge book and dictionary together. I use dplyr library mentioned by #wibeasley.
# for big data frames dplyr does merging fast
require("dplyr")
# make all letters uppercase
bookdf[,1] <- toupper(bookdf[,1])
# merge
bookphon <- left_join(bookdf, dictdf, by = c("words" = "V1"))
# words V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
And after that we scan rowwise for matching sounds in consecutive words. I arranged it with the help of sapply.
consonanceScore <-
sapply(1:(nrow(bookphon)-2),
conScore <- function(i_row)
{
word1 <- bookphon[i_row,][,-1]
word2 <- bookphon[i_row+1,][,-1]
word3 <- bookphon[i_row+2,][,-1]
word1 <- unlist( word1[which(!is.na(word1) & word1 != "")] )
word2 <- unlist( word2[which(!is.na(word2) & word2 != "")] )
word3 <- unlist( word3[which(!is.na(word3) & word3 != "")] )
sum(word1 %in% word2) + sum(word1 %in% word3)
})
[1] 0 0 0 4
There are no same phonemes in first three rows but the 4-th word 'dog' has 2 matching sounds with 'called' (D and O/A) and 2 matches with 'dogma' (D and G). The result is a numeric vector, you can sum() it, divide by nrow(bookdf) or whatever you need.
Are you sure it's working correctly? Isn't that function returning consonanceScore just for the last three words of the book? If the loop's 3rd-to-last-line is
consonanceScore <- sum(word1 %in% word2)
, how is its value being recorded, or influencing later iterations of the loop?
There are several vectorization approaches that will increase your speed, but for something tricky like this, I like making sure the slow loopy way is working correctly first. While you're in that stage of development, here are some suggestions how to make the code quicker and/or neater (which hopefully helps you debug with more clarity).
Short-term suggestions
Inside replaceIfEmpty(), use ifelse(). Maybe even use ifelse() directly inside the main function.
Why is as.character() necessary? That casting can be expensive. Are those columns factors? If so, use , stringsAsFactors=F when you use something like read.csv().
Don't use toupper() three times for each iteration. Just convert the whole thing once before the loop starts.
Similarly, don't execute / length(book) for each iteration. Since it's the same denominator for the whole book, divide the final vector of numerators only once (after the loop's done).
Long-term suggestions
Eventually I think you'll want to lookup each word only once, instead of three times. Those lookups are expensive. Similar to #inscaven 's suggestion, I think an intermediate table make sense (where each row is a book's word).
To produce the intermediate table, you should get much better performance from a join function written and optimized by someone else in C/C++. Consider something like dplyr::left_join(). Maybe book has to be converted to a single-variable data.frame first. Then left join it to the first column of the dictionary. The row's subsequent columns will essentially be appended to the right side of book (which I think is what's happening now).
Once each iteration is quicker and correct, consider using one of the xapply functions, or something in dplyr. The advantage of these functions is that memory for the entire vector isn't destroyed and reallocated for every single word in each book.