Related
I have a two data table (csv) which contain information about a MOOC course.
The first table contains information about mouse movments (distance). Like this:
1-2163.058../2-20903.66351.../3-25428.5415..
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the distance in pixel. (2163.058, 20903.66351, etc.).
The second table contains the same information but instead of distance, there is the time was recorded. Like this:
1-4662.0/2-43738.0/3-248349.0....
The first number means the day (1- first day, 2- second day, etc.) when it happens, the second number means the time in milliseconds.
In the table, every column records a data from the specific web page, and every row records a user behaviour on this page.
I want to create a new table with the same formation, where I want to count the speed by pixel. Divide the distance table with time table which gives new table with same order, shape.
Here are two links for the two tables goo.gl/AVQW7D goo.gl/zqzgaQ
How can I do this with raw csv?
> dput(distancestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("1-0", "1-42522.28760403924"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id", "http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
> dput(timestream[1:3,1:3])
structure(list(id = c(2L, 9L, 10L),
`http//tanul.sed.hu/mod/szte/frontpage.php` = structure(c(2L, 1L, 1L),
.Label = c("0", "1-189044.0"),
class = "factor"),
`http//tanul.sed.hu/mod/szte/register.php` = c(0L, 0L, 0L)),
.Names = c("id",
"http//tanul.sed.hu/mod/szte/frontpage.php",
"http//tanul.sed.hu/mod/szte/register.php"),
class = c("data.table", 0x0000000002640788))
This may not be the most efficient method, but I believe it should yield the result you are looking for.
# Set file paths
dist.file <- # C:/Path/To/Distance/File.csv
time.file <- # C:/Path/To/Time/File.csv
# Read data files
dist <- read.csv(dist.file, stringsAsFactors = FALSE)
time <- read.csv(time.file, stringsAsFactors = FALSE)
# Create dataframe for speed values
speed <- dist
speed[,2:ncol(speed)] <- NA
# Create progress bar
pb <- txtProgressBar(min = 0, max = ncol(dist) * nrow(dist), initial = 0, style = 3, width = 20)
item <- 0
# Loop through all columns and rows of distance data
for(col in 2:ncol(dist)){
for(r in 1:nrow(dist)){
# Check that current item has data to be calculated
if(dist[r,col] != 0 & dist[r,col] != "1-0" & !is.na(time[r,col])){
# Split the data into it's separate day values
dists <- lapply(strsplit(strsplit(dist[r,col], "/")[[1]], "-"), as.numeric)
times <- lapply(strsplit(strsplit(time[r,col], "/")[[1]], "-"), as.numeric)
# Calculate the speeds for each day
speeds <- sapply(dists, "[[", 2) / sapply(times, "[[", 2)
# Paste together the day values and assign to the current item in speed dataframe
speed[r,col] <- paste(sapply(dists, "[[", 1), format(speeds, digits = 20), sep = "-", collapse = "/")
} else{
# No data to calculate, assign 0 to current item in speed dataframe
speed[r,col] <- 0
}
# Increase progress bar counter
item <- item + 1
setTxtProgressBar(pb,item)
}
}
# Create a csv for speed data
write.csv(speed, "speed.csv")
I Have the following dataset which includes 2 variables:
dt4<-structure(list(a1 = c(4L, 4L, 3L, 4L, 4L), a2 = c(1L,
3L, 4L, 5L, 4L)), .Names = c("a1", "a2"
), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
I Have the following function that add labels and levels to an existing dataset:
Add_Labels_Level_To_Dataset <- function(df, df_name,levels_list,labels_list) {
df[] <- lapply( df, ordered)
for (i in 1:length(colnames(df))) {
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
}
df
}
which is run by that R command:
Add_Labels_Level_To_Dataset(dt4, "dt4", level_list, labels_list)
The lists supplied in the R command are the following ones which represents the ordered levels of each variable in the dataset, respectively:
label_list=list("'S','SA','SB','SC,'SD'", "'S','SA','SB','SC,'SD'")
level_list=list("5,4,3,2,1", "5,4,3,2,1")
Why my function is not working properly?
I dont know what is wrong with that!
When I run the R commands outside R function, they tie levels/ labels to the dataset given. However, when I run my R function, this does not happen!
df_name="dt4"
df=dt4
levels_list=level_list
labels_list=label_list
i=3
df[] <- lapply( df, ordered)
arg0<-paste0(df_name,"[i]", "<-ordered(", df_name, "$'", colnames(df)[i], "', levels=c(", levels_list[[i]], "), labels = c(", labels_list[[i]],"))" )
eval(parse(text=arg0))
Can you help?
This is a xy problem. I agree with #MrFlick that parse should be avoided.
On the original post the main issue is the function should be returning dt4 and not df. There are some missing ' (single quote) when defining label_list.
We could use mapply and avoid the single quote:
label_list=list(c('S','SA','SB','SC','SD'), c('S','SA','SB','SC','SD'))
level_list=list(c(5,4,3,2,1), c(5,4,3,2,1))
as.data.frame(mapply(function(x, labels,levels ) {ordered(x, labels,levels)}, dt4, level_list, label_list, SIMPLIFY = F))
# a1 a2
#1 SA SD
#2 SA SB
#3 SB SA
#4 SA S
#5 SA SA
Using eval/parse should be avoided. There are tpyically much easier ways to do what you want in R. For example, with this code, we can just write
Add_Labels_Level_To_Dataset <- function(df, levels_list, labels_list) {
df[] <- Map(function(data, levels, labels) {
ordered(data, levels=strsplit(levels,",")[[1]], labels=strsplit(labels, ",")[[1]])
}, df, levels_list, labels_list)
df
}
And we can call it like
dt4 <- Add_Labels_Level_To_Dataset(dt4, level_list, label_list)
Note that it returns a new data.frame which you can reassign to dt4 or some other variable. Functions in R should never modify objects outside their own scope which is one of the other reasons you were running into problems with your function.
it's a fairly simple task, but I'm trying to wrap my head around how to match values using a dataframe with keys and values. I've tried merge, but as the number of rows is different, I'm not sure that's appropriate.
Is there a for loop I can write that will loop through each key in my input dataframe and change Product's value if it's one of the ones in the lookup table?
Essentially, my data looks like this:
input_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
input_product <- c("Water", "Bread", NA, "Chips", "Chicken", NA, "Chocolate", "Donuts", "Juice")
input <- as.data.frame(cbind(input_key, input_product))
I'd like to replace the NAs with the Product values in the corresponding lookup table:
lookup_key <- c(1245,1546, 7764, 9909)
lookup_product <- c("Ice Cream","Soda", "Bacon","Cheese")
lookup_data <- as.dataframe(cbind(lookup_key, lookup_product))
Finally, I'm hoping to get the final dataframe looking like this:
output_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
output_product <- c("Water", "Bread", "Soda", "Chips", "Chicken", "Cheese", Chocolate","Donuts", "Juice")
output_data <- as.data.frame(cbind(output_key, output_product))
OPTION 1: Using R-base functions:
Vectorial solution:
input$input_product[input$input_key %in% lookup_data$lookup_key == TRUE] <-
lookup_product[lookup_data$lookup_key %in% input$input_key == TRUE]
Note: The ==TRUE is redundant, added just for better understanding.
Using lapply function:
idx <- input$input_key %in% lookup_data$lookup_key
lapply((1:nrow(input)),
function(i) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <<- lookup_data$lookup_product[jdx == TRUE]
}
}
)
Note: Attention to the global assignment operation (<<)
Using for loop:
idx <- input$input_key %in% lookup_data$lookup_key
for (i in (1:nrow(input))) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <- lookup_data$lookup_product[jdx == TRUE]
}
}
Note: Here we just need a simple assignment.
In the above cases you need to create the data frames setting the input argument: stringsAsFactors as FALSE, for example:
input <- as.data.frame(cbind(input_key, input_product), stringsAsFactors = FALSE)
lookup_data <- as.data.frame(cbind(lookup_key, lookup_product), stringsAsFactors = FALSE)
Then you get the output:
> input
input_key input_product
1 9061 Water
2 8680 Bread
3 1546 Soda
4 5376 Chips
5 9550 Chicken
6 9909 Cheese
7 3853 Chocolate
8 3732 Donuts
9 9209 Juice
>
OPTION 2: Using data.tablepackage
I found this elegant solution using inner join:
require(data.table)
setkey(input,input_key)
setkey(lookup_data,lookup_key)
> setDT(input)[setDT(lookup_data), input_product := i.lookup_product, nomatch=0][]
input_key input_product
1: 1546 Soda
2: 3732 Donuts
3: 3853 Chocolate
4: 5376 Chips
5: 8680 Bread
6: 9061 Water
7: 9209 Juice
8: 9550 Chicken
9: 9909 Cheese
>
data.tableis actually very powerful for data set manipulation. Let's explain the syntax behind:
setDT: Converts a data frame by reference (no copy occurs) into data.table, because the original data sets are not a data.table classes, that's the way to
convert them on the fly. Notice that now it is not necessary to use the attribute stringsAsFactors because for data.tableits default value is FALSE.
input[lookup_data, nomatch=0]: Is the way, with data.table package to create a inner join (see this link). It means the interception of both tables. The no match option with value 0 means no rows will be returned for that row of i (in our case: lookup_data).
This would be the output:
> setDT(input)[setDT(lookup_data), nomatch=0][]
input_key input_product lookup_product
1: 1546 NA Soda
2: 9909 NA Cheese
>
input_product := i.lookup_product: assigns the column from the outer
data set, with the value of the inner data set.
[]: Prints the result (for verifying the solution purpose)
For more information about data.tableI would recommend to read the package documentation it comes with many examples. It is also a good idea to run in R the following command (after loading the data.tablepackage):
example(data.table)
It provides more than 50 examples (the same from the package documentation) with its corresponding result about the different uses of this package.
PERFORMANCE
Let's compare all possible alternatives in terms of performance. Then we need to modify
the input data set for increasing its size:
rep.num <- 1000
input_key <- rep(c(9061,8680,1546,5376,9550,9909,3853,3732,9209),rep.num)
input_product <- rep(c("Water", "Bread", NA, "Chips", "Chicken", NA, "Chocolate",
"Donuts", "Juice"),rep.num)
input <- as.data.frame(cbind(input_key, input_product), stringsAsFactors=F)
Wrap all different alternatives into a corresponding given function. I have included
the solution via dplyr proposed by #count
vectSol <- function(input, lookup_data) {
input$input_product[input$input_key %in% lookup_data$lookup_key == TRUE] <-
lookup_product[lookup_data$lookup_key %in% input$input_key == TRUE]
return(input)
}
lapplySol <- function(input, lookup_data) {
idx <- input$input_key %in% lookup_data$lookup_key
lapply((1:nrow(input)),
function(i) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <<- lookup_data$lookup_product[jdx == TRUE]
}
}
)
return(input)
}
forSol <- function(input, lookup_data) {
idx <- input$input_key %in% lookup_data$lookup_key
for (i in (1:nrow(input))) {
if (idx[i] == TRUE) {
jdx <- lookup_data$lookup_key %in% input$input_key[i]
input$input_product[i] <- lookup_data$lookup_product[jdx == TRUE]
}
}
return(input)
}
dataTableSol <- function (input, lookup_data) {
setkey(input,input_key)
setkey(lookup_data,lookup_key)
input[lookup_data, input_product := i.lookup_product, nomatch=0]
return(input)
}
dplyrSol <- function(input, lookup_data) {
rbind(input[!is.na(input$input_product),],
inner_join(lookup_data,input,by=c("lookup_key"="input_key")) %>%
select(lookup_key,lookup_product) %>%
rename(input_product = lookup_product, input_key = lookup_key))
return(input)
}
Now test each solution (double check).
Make a copy of the input data set, because data.table operate by reference. We need to create a copy from scratch.
input.copy <- setDT(as.data.frame(cbind(input_key, input_product), stringsAsFactors=F))
lookup_data.copy<- setDT(as.data.frame(cbind(lookup_key, lookup_product),
stringsAsFactors=F))
input1.out <- vectSol(input, lookup_data)
input2.out <- lapplySol(input, lookup_data)
input3.out <- forSol(input, lookup_data)
input4.out <- forSol(input, lookup_data)
input5.out <- dataTableSol(copy(input.copy), lookup_data.copy)
We use the package compare because all.equal fails for comparing a data frame
with a data.table object, because the attributes values, therefore we need a
comparison that only checks the values.
library(compare)
OK <- all(
all.equal(input1.out, input2.out) && all.equal(input1.out, input3.out)
&& all.equal(input1.out, input4.out)
&& compare(input1.out[order(input1.out$input_key),],
input5.out, ignoreAttrs=T)$result
)
try(if(!OK) stop("Result are not the same for all methods"))
Now let's to use microbenchmarkpackage for comparing the time performance of all solutions
library(microbenchmark)
op <- microbenchmark(
VECT = {vectSol(input, lookup_data)},
FOR = {forSol(input, lookup_data)},
LAPPLY = {lapplySol(input, lookup_data)},
DPLYR = {dplyrSol(input, lookup_data)},
DATATABLE = {dataTableSol(input.copy, lookup_data.copy)},
times=100L)
print(op)
Here is the result:
Unit: milliseconds
expr min lq mean median uq max neval cld
VECT 1.005890 1.078983 1.384964 1.108162 1.282269 6.562040 100 a
FOR 416.268583 438.545475 476.551526 449.679426 476.032938 740.027018 100 b
LAPPLY 428.456092 454.664204 492.918478 464.204607 501.168572 751.786224 100 b
DPLYR 13.371847 14.919726 16.482236 16.105815 17.086174 23.537866 100 a
DATATABLE 1.699995 2.059205 2.427629 2.279371 2.489406 8.542219 100 a
Additionally we can graph the solution, via:
library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()
The best performance on this order is: Vectorial, data.table, dplyr, for-loop, lapply.
Pretty tired so this is clumsy, but it should work for the data provided (your output sample is probaly wrong though):
require(dplyr)
rbind(input[!is.na(input$input_product),],
inner_join(lookup_data,input,by=c("lookup_key"="input_key")) %>%
select(lookup_key,lookup_product) %>%
rename(input_product = lookup_product, input_key = lookup_key))
This is easily done using the data.table package as follows:
# load sample data
input_data <- structure(list(
input_key =
structure(c(6L, 5L, 1L, 4L, 8L, 9L,
3L, 2L, 7L),
.Label = c("1546", "3732", "3853", "5376", "8680",
"9061", "9209", "9550", "9909"), class = "factor"),
input_product = structure(c(7L, 1L, NA, 3L, 2L, NA, 4L, 5L, 6L),
.Label = c("Bread", "Chicken", "Chips", "Chocolate",
"Donuts", "Juice", "Water"), class = "factor")),
.Names = c("input_key",
"input_product"),
row.names = c(NA, -9L), class = "data.frame")
lookup_data <- structure(list(
lookup_key = structure(1:4,
.Label = c("1245", "1546", "7764", "9909"), class = "factor"),
lookup_product = structure(c(3L,
4L, 1L, 2L), .Label = c("Bacon", "Cheese", "Ice Cream", "Soda"
), class = "factor")), .Names = c("lookup_key", "lookup_product"
), row.names = c(NA, -4L), class = "data.frame")
# convert to data.table and add keys for merging
library(data.table)
input <- data.table(input_data, key = 'input_key')
lookup <- data.table(lookup_data, key = 'lookup_key')
# merge the data (can use merge method as well)
DT <- lookup[input]
# where the input_product is NA, replace with lookup
DT[is.na(input_product), input_product := lookup_product]
print(DT)
# you can now get rid of lookup_product column, if you like
DT[, lookup_product:= NULL]
print(DT)
The final output of the above is:
> print(DT)
lookup_key input_product
1: 1546 Soda
2: 3732 Donuts
3: 3853 Chocolate
4: 5376 Chips
5: 8680 Bread
6: 9061 Water
7: 9209 Juice
8: 9550 Chicken
9: 9909 Cheese
How can i compare two datasets and find the common gene names, provided if CNA and chr of both datasets are same
dt1
CNA chr Genes
gain 5 Sall3,Kcng2,Atp9b,Nfatc1,Ctdp1
loss 5 RNU6-866P,TRIM5,TRIM34,TRIM22,TRIM5
gain 2 PDIA5,SEMA5B
dt2
CNA chr Genes
gain 5 Sall3,Nfatc1,SNORA5,SNORA5
gain 5 RNU6-866P,OR8J1,OR8K3,OR8K3
gain 2 PDIA5,DCC
expected output
df3
CNA chr Genes
gain 5 Sall3,Nfatc1
gain 2 PDIA5
Im sure this is trivial question, but would love to have suggestions to help me a bit.
Here is an approach:
library(data.table)
df2 = setDT(df2)[,list(Genes=paste0(Genes, collapse=',')),by=list(CNA, chr)]
res = setkey(setDT(df1), CNA, chr)[df2]
# CNA chr Genes Genes.1
#1: gain 5 Sall3,Kcng2,Atp9b,Nfatc1,Ctdp1 Sall3,Nfatc1,SNORA5,SNORA5,RNU6-866P,OR8J1,OR8K3,OR8K3
#2: gain 2 PDIA5,SEMA5B PDIA5,DCC
res[, paste0(intersect(strsplit(Genes,',')[[1]], strsplit(Genes.1,',')[[1]]), collapse=',')
, by=list(CNA, chr)]
# CNA chr V1
#1: gain 5 Sall3,Nfatc1
#2: gain 2 PDIA5
Data:
df1 = structure(list(CNA = c("gain", "gain", "loss"), chr = c(2L, 5L,
5L), Genes = c("PDIA5,SEMA5B", "Sall3,Kcng2,Atp9b,Nfatc1,Ctdp1",
"RNU6-866P,TRIM5,TRIM34,TRIM22,TRIM5")), .Names = c("CNA", "chr",
"Genes"), class = "data.frame", row.names = c(NA, -3L))
df2 = structure(list(CNA = c("gain", "gain", "gain"), chr = c(5L, 5L,
2L), Genes = c("Sall3,Nfatc1,SNORA5,SNORA5", "RNU6-866P,OR8J1,OR8K3,OR8K3",
"PDIA5,DCC")), .Names = c("CNA", "chr", "Genes"), class = "data.frame", row.names = c(NA,
-3L))
Not very elegant but
dt1 <- read.table(header = TRUE, text = "CNA chr Genes
gain 5 Sall3,Kcng2,Atp9b,Nfatc1,Ctdp1
loss 5 RNU6-866P,TRIM5,TRIM34,TRIM22,TRIM5
gain 2 PDIA5,SEMA5B", stringsAsFactors = FALSE)
dt2 <- read.table(header = TRUE, text= "CNA chr Genes
gain 5 Sall3,Nfatc1,SNORA5,SNORA5
gain 5 RNU6-866P,OR8J1,OR8K3,OR8K3
gain 2 PDIA5,DCC", stringsAsFactors = FALSE)
f <- function(x, y, z = 'Genes') {
## split the genes out and find common ones
xx <- strsplit(x[, z], ',')
yy <- strsplit(y[, z], ',')
res <- lapply(seq_along(xx), function(ii)
intersect(xx[[ii]], yy[[ii]]))
## combine back into one of the data frames
res <- lapply(res, paste, collapse = ',')
res <- cbind(x[, 1:2], Genes = do.call('rbind', res))
## make sure the chr and alterations are the same and only return those
idx <- sapply(1:nrow(x), function(ii) all(x[ii, 1:2] == y[ii, 1:2]))
res[idx, ]
}
f(dt1, dt2)
# CNA chr Genes
# 1 gain 5 Sall3,Nfatc1
# 3 gain 2 PDIA5
I'm having an issue using apply functions (which I assume is the right way to do the following) across multiple data frames.
Some example data (3 different data frames, but the problem I'm working on has upwards of 50):
biz <- data.frame(
country = c("england","canada","australia","usa"),
businesses = sample(1000:2500,4))
pop <- data.frame(
country = c("england","canada","australia","usa"),
population = sample(10000:20000,4))
restaurants <- data.frame(
country = c("england","canada","australia","usa"),
restaurants = sample(500:1000,4))
Here's what I ultimately want to do:
1) Sort eat data frame from largest to smallest, according to the variable that's included
dataframe <- dataframe[order(dataframe$VARIABLE,)]
2) then create a vector variable that gives me the rank for each
dataframe$rank <- 1:nrow(dataframe)
3) Then create another data frame that has one column of the countries and the rank for each of the variables of interest as other columns. Something that would look like (rankings aren't real here):
country.rankings <- structure(list(country = structure(c(5L, 1L, 6L, 2L, 3L, 4L), .Label = c("brazil",
"canada", "england", "france", "ghana", "usa"), class = "factor"),
restaurants = 1:6, businesses = c(4L, 5L, 6L, 3L, 2L, 1L),
population = c(4L, 6L, 3L, 2L, 5L, 1L)), .Names = c("country",
"restaurants", "businesses", "population"), class = "data.frame", row.names = c(NA,
-6L))
So I'm guessing there's a way to put each of these data frames together into a list, something like:
lib <- c(biz, pop, restaurants)
And then do an lapply across that to 1) sort, 2)create the rank variable and 3) create the matrix or data frame of rankings for each variable (# of businesses, population size, # of restaurants) for each country. Problem I'm running into is that writing the lapply function to sort each data frame runs into issues when I try to order by the variable:
sort <- lapply(lib,
function(x){
x <- x[order(x[,2]),]
})
returns the error message:
Error in `[.default`(x, , 2) : incorrect number of dimensions
because I'm trying to apply column headings to a list. But how else would I tackle this problem when the variable names are different for every data frame (but keeping in mind that the country names are consistent)
(would also love to know how to use this using plyr)
Ideally I'd would recommend data.table for this.
However, here is a quick solution using data.frame
Try this:
Step1: Create a list of all data.frames
varList <- list(biz,pop,restaurants)
Step2: Combine all of them in one data.frame
temp <- varList[[1]]
for(i in 2:length(varList)) temp <- merge(temp,varList[[i]],by = "country")
Step3: Get ranks:
cbind(temp,apply(temp[,-1],2,rank))
You can remove the undesired columns if you want!!
cbind(temp[,1:2],apply(temp[,-1],2,rank))[,-2]
Hope this helps!!
totaldatasets <- c('biz','pop','restaurants')
totaldatasetslist <- vector(mode = "list",length = length(totaldatasets))
for ( i in seq(length(totaldatasets)))
{
totaldatasetslist[[i]] <- get(totaldatasets[i])
}
totaldatasetslist2 <- lapply(
totaldatasetslist,
function(x)
{
temp <- data.frame(
country = totaldatasetslist[[i]][,1],
countryrank = rank(totaldatasetslist[[i]][,2])
)
colnames(temp) <- c('country', colnames(x)[2])
return(temp)
}
)
Reduce(
merge,
totaldatasetslist2
)
Output -
country businesses population restaurants
1 australia 3 3 3
2 canada 2 2 2
3 england 1 1 1
4 usa 4 4 4