a<-as.factor(c('a','a','b','b','c','d'))
b<-as.factor(c('a','b','c','c','d','a'))
c<-as.factor(c('a','b','d','d','c','b'))
x<-data.frame(a,b,c)
a b c
1 a a a
2 a b b
3 b c d
4 b c d
5 c d c
6 d a b
I have a very large data table (using datatable package) and I would like to simply
take the column names and append them to the row factor values for easy identification.
So in the above simple example (using a data frame for illustration) I would have something
like
a b c
a:a b:a c:a
a:a b:b c:b
a:b b:c c:d
..
..
a:d b:a c:b
I had tried (unsuccessfully) to do some type of apply and paste combination.
But I can't quite pass the colname arguments to paste to each column correctly.
Any ideas on how I could accomplish this task for large data tables? A datatable
approach would be great, but dataframe is fine as well, since it's only a one time
action.
Data frame solution:
x[] <- mapply(function(n, f) {
levels(f) <- paste(n, levels(f), sep=":")
f
}, names(x), x)
Related
I'm trying to add those data with each other, but I found "N/A" in the final output when I enter new names didn't exists in the first vector, so how can i handle it to show all the data without any "N/A"
I think you just want to merge/append two factors, an easy approach would be to convert them to character, append them and make it a factor again.
Just a simple example with letters
p <- as.factor(LETTERS[3:8])
q <- as.factor(LETTERS[1:5])
as.factor(c(as.character(p), as.character(q)))
# [1] C D E F G H A B C D E
# Levels: A B C D E F G H
I apologize as I wasn't quite sure how to word my question without making it extremely lengthy, as the duplicate rows also need to have some altered values from the original.
I have two data frames. The first, df1, records all paths actually taken from source to destination, while the second, df2, contains all possible paths. Some sample data is below:
df1
Row
Source
Destination
Payload
1
A
B
10010101
2
A
D
11101011
3
A
B
10111111
4
E
B
01100110
df2
Row
Source
Destination
1
A
B
2
B
A
3
B
C
4
B
E
5
B
F
6
A
D
7
D
A
8
D
C
9
D
H
For my data, it is assumed that if an object takes a path A -> B for example, it also takes every possible path stemming from B that isn't to the original source (Think of a networking hub. In one way, and out every other). So since we have a payload that goes from A -> B, I also need to record that same payload going from B to C, E, and F. I'm currently accomplishing this in the FOR loop below, but I would like to know if there is a better way to do it, preferably one that doesn't use looping. I'm also somewhat new to R, so even simple corrections to my code are also appreciated.
for (row in 1:dim(df1)[1]){
initialSource <- df1$source[row] #saves the initial source
paths <- df1[row,] #saves the current row for duplication
paths <- paths[rep(1, times = count(df2[df2$source %in% df1$destination[row], ])[[1]]), ] #duplicates the row
paths$source <- paths$destination #replaces the source values to be the location of the hub
paths$destination <- df2$destination[df2$source %in% paths$destination] #replaces the destination values to be every connection from the hub
paths <- paths[!(paths$destination %in% initialSource), ] #removes the row that would indicate data being sent back to the source
masterdf <- rbind(masterdf, paths) #saving the new data to a larger data frame that df1 is actually a sample of.
}
The data frame paths by the end of the first loop with the above data would look like:
Row
Source
Destination
Payload
1
B
C
10010101
2
B
E
10010101
3
B
F
10010101
Maybe you could try merging your two dataframes. With base R merge you could do the following (using "Destination" from df1 and "Source" from df2). You would need to remove rows to exclude the "original source" as you described. Renaming and selecting the columns gives you the final output. Please let me know if this is what you had in mind.
d <- subset(
merge(df1, df2, by.x = "Destination", by.y = "Source", all = TRUE),
Source != Destination.y
)
data.frame(
Source = d$Destination,
Destination = d$Destination.y,
Payload = d$Payload
)
Output
Source Destination Payload
1 B C 10010101
2 B E 10010101
3 B F 10010101
4 B C 10111111
5 B E 10111111
6 B F 10111111
7 B C 1100110
8 B F 1100110
9 B A 1100110
10 D C 11101011
11 D H 11101011
Three text files are in the same directory ("data001.txt", "data002.txt", "data003.txt"). I write a loop to read each data file and generate three data tables;
for(i in files) {
x <- read.delim(i, header = F, sep = "\t", na = "*")
setnames(x, 2, i)
assign(i,x)
}
So let's say each individual table looks something like this:
var1 var2 var3
row1 2 1 3
I've used rbind to combine all of the tables...
combined <- do.call(rbind, mget(ls(pattern="^data")))
and get something like this:
var1 var2 var3
row1 2 1 3
var1 var2 var3
row1 3 2 4
var1 var2 var3
row1 1 3 5
leaving me with superfluous column names. At the moment I can get around this by just deleting that specific row containing the column names, but it's a bit clunky.
colnames(combined) = combined[1, ] # make the first row the column names
combined <- combined[-1, ] # delete the now-unnecessary first row
toDelete <- seq(1, nrow(combined), 2) # define which rows to be deleted i.e. every second odd row
combined <- combined[ toDelete ,] # delete them suckaz
This does give me what I want...
var1 var2 var3
row1 2 1 3
row1 3 2 4
row1 1 3 5
But I feel like a better way would simply be to extract the values of "row1" as a vector or as a list or whatever, and combine them all together into one data table. I feel like there is a quick and easy way to do this but I haven't been able to find anything yet. I've had a look here and here and here.
One possibility is to take the second row (that I want), and convert it into a matrix (then transpose it to make it a row instead of column!?) and rbind:
data001.txt <- as.matrix(data001.txt[2,])
data001.txt <- t(data001.txt)
combined <- rbind(data001.txt, data002.txt)
This gives me more or less what I want except without the column name headers (e.g. va1, var2, var3).
v1 v2 v3
2 1 3
3 2 4
Any ideas? Would this second method work well if there is some way to add the column names? I feel like it's less clunky than the first method. Thanks for any input :)
edit - solved in answer below.
Figured it out. Converting to data matrix and using set.names from data.table package required. Say you have a range of text data files like the one that follows, and you want to extract just the seventh column (the one with the numbers, not letters), and combine them together in their own data table including the row names:
chemical1 a b c d e 1 g h i j k l m
chemical2 a b c d e 2 g h i j k l m
chemical3 a b c d e 3 g h i j k l m
chemical4 a b c d e 4 g h i j k l m
chemical5 a b c d e 5 g h i j k l m
setting row.names = 1 and header = F.
setwd("directory")
files <- list.files(pattern = "data") # take all files with 'data' in their name
for(i in files) {
x <- read.delim(i, row.names = 1, header = F, sep = "\t", na = "*")
setnames(x, 6, i) # if the data you want is in column six. Sets data file name as the column name.
x <- as.matrix(x[6]) # just take the sixth column with the numeric data (delete everything else)
x <- t(x) # transform (if you want..)
assign(i,x)
}
combined <- do.call(rbind, mget(ls(pattern="^data"))) # combine the data matrices into one table
write.table(combined, file="filename.csv", sep=",", row.names=T, col.names = NA)
I am trying to use a data.table within a function, and I am trying to understand why my code is failing. I have a data.table as follows:
DT <- data.table(my_name=c("A","B","C","D","E","F"),my_id=c(2,2,3,3,4,4))
> DT
my_name my_id
1: A 2
2: B 2
3: C 3
4: D 3
5: E 4
6: F 4
I am trying to create all pairs of "my_name" with different values of "my_id", which for DT would be:
Var1 Var2
A C
A D
A E
A F
B C
B D
B E
B F
C E
C F
D E
D F
I have a function to return all pairs of "my_name" for a given pair of values of "my_id" which works as expected.
get_pairs <- function(id1,id2,tdt) {
return(expand.grid(tdt[my_id==id1,my_name],tdt[my_id==id2,my_name]))
}
> get_pairs(2,3,DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Now, I want to execute this function for all pairs of ids, which I try to do by finding all pairs of ids and then using mapply with the get_pairs function.
> combn(unique(DT$my_id),2)
[,1] [,2] [,3]
[1,] 2 2 3
[2,] 3 4 4
tid1 <- combn(unique(DT$my_id),2)[1,]
tid2 <- combn(unique(DT$my_id),2)[2,]
mapply(get_pairs, tid1, tid2, DT)
Error in expand.grid(tdt[my_id == id1, my_name], tdt[my_id == id2, my_name]) :
object 'my_id' not found
Again, if I try to do the same thing without an mapply, it works.
get_pairs3(tid1[1],tid2[1],DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.
Alternatively, is there a different/more efficient way to accomplish this task? I have a large data.table with a third id "sample" and I need to get all of these pairs for each sample (e.g. operating on DT[sample=="sample_id",] ). I am new to the data.table package, and I may not be using it in the most efficient way.
The function debugonce() is extremely useful in these scenarios.
debugonce(mapply)
mapply(get_pairs, tid1, tid2, DT)
# Hit enter twice
# from within BROWSER
debugonce(FUN)
# Hit enter twice
# you'll be inside your function, and then type DT
DT
# [1] "A" "B" "C" "D" "E" "F"
Q # (to quit debugging mode)
which is wrong. Basically, mapply() takes the first element of each input argument and passes it to your function. In this case you've provided a data.table, which is also list. So, instead of passing the entire data.table, it's passing each element of the list (columns).
So, you can get around this by doing:
mapply(get_pairs, tid1, tid2, list(DT))
But mapply() simplifies the result by default, and therefore you'd get a matrix back. You'll have to use SIMPLIFY = FALSE.
mapply(get_pairs, tid1, tid2, list(DT), SIMPLIFY = FALSE)
Or simply use Map:
Map(get_pairs, tid1, tid2, list(DT))
Use rbindlist() to bind the results.
HTH
Enumerate all possible pairs
u_name <- unique(DT$my_name)
all_pairs <- CJ(u_name,u_name)[V1 < V2]
Enumerate observed pairs
obs_pairs <- unique(
DT[,{un <- unique(my_name); CJ(un,un)[V1 < V2]}, by=my_id][, !"my_id"]
)
Take the difference
all_pairs[!J(obs_pairs)]
CJ is like expand.grid except that it creates a data.table with all of its columns as its key. A data.table X must be keyed for a join X[J(Y)] or a not-join X[!J(Y)] (like the last line) to work. The J is optional, but makes it more obvious that we're doing a join.
Simplifications. #CathG pointed out that there is a cleaner way of constructing obs_pairs if you always have two sorted "names" for each "id" (as in the example data): use as.list(un) in place of CJ(un,un)[V1 < V2].
Why does this function fail only when used within an mapply? I think
this has something to do with the scope of data.table names, but I'm
not sure.
The reason the function is failing has nothing to do with scoping in this case. mapply vectorizes the function, it takes each element of each parameter and passes to the function. So, in your case, the data.table elements are its columns, so mapply is passing the column my_name instead of the complete data.table.
If you want to pass the complete data.table to mapply, you should use the MoreArgs parameter. Then your function will work:
res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE)
do.call("rbind", res)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
5 A E
6 B E
7 A F
8 B F
9 C E
10 D E
11 C F
12 D F
I have a dataframe in R with fields x,y,A,B like this:
d <- data.frame(x=seq(1,26),y=seq(1,26,by=2),A=LETTERS[seq(1,26)],
B=letters[seq(1,26)])
and a vector of columnames:
colnames <- c('A','B')
I would like to add extra fields to the dataframe A_mod and B_mod with values based on a function which processes the value of resp A and B of the same row.
So I would like to end up with a dataframe with fields x, y, A, B and A_mod and B_mod. The values of A_mod and B_mod are calculated by resp
A_mod = tolower(A)
and
B_mod = toupper(B) for instance (just a mock example).
How can I do that without explicitly naming the columns A, B and A_mod and B_mod, but instead use the vector colnames. In reality my dataframe has 30 columns, so writing all by hand is tedious (and the amount of columns will increase).
Try
d[paste0(colnames,'_mod')] <- lapply(d[colnames], tolower)
head(d,3)
# x y A B A_mod B_mod
#1 1 1 A a a a
#2 2 3 B b b b
#3 3 5 C c c c