How to remove row wise sorting in R dataframes - r

I have the following code written to take bind two columns and create a data frame.
complete<-function(directory,id){
x<-vector()
y<-vector()
files<-list.files(directory,full.names=TRUE)
for(i in id){
x[i]<-i
y[i]<-sum(complete.cases(read.csv(files[i])))
}
d<-na.omit(data.frame(x,y))
colnames(d)<-c("id","nobs")
rownames(d)<-1:nrow(d)
print(d)
}
I have the following test case :
complete("specdata",30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 71
6 30 932
I am not able get the output in the order called by the function. i.e.
id=30 as the first value and id=25 as the last value. How do I get to disable automatic sorting by id?

We can change for(i in id) to for(i in seq_along(id)) to loop by the sequence of 'id'. Also, make some necessary changes in assigning x[i] and y[i].
complete<-function(directory, id){
x<- vector()
y<- vector()
files<-list.files(directory,full.names=TRUE)
for(i in seq_along(id)){
x[i]<- id[i]
y[i]<-sum(complete.cases(read.csv(files[id[i]])))
}
d<-na.omit(data.frame(x,y))
colnames(d)<-c("id","nobs")
rownames(d)<-1:nrow(d)
print(d)
}
Testing
complete('specdata', 25:30)
#id nobs
#1 25 4
#2 26 0
#3 27 1
#4 28 1
#5 29 2
#6 30 13
complete('specdata', 30:25)
# id nobs
#1 30 13
#2 29 2
#3 28 1
#4 27 1
#5 26 0
#6 25 4
NOTE: The values are different because the 'specdata' directory that I have is from a previous coursera link. They might have updated the data

Related

How to sort a data frame by column?

I want sort a data frame by datas of a column (the first column, called Initial). My data frame it's:
I called my dataframe: t2
Initial Final Changes
1 1 200
1 3 500
3 1 250
24 25 175
21 25 180
1 5 265
3 3 147
I am trying with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=False),]
But, the result is of the type:
Initial Final Changes
3 1 250
3 3 147
21 25 180
24 25 175
1 5 265
1 1 200
1 3 500
And when I try with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=TRUE),]
The result is:
Initial Final Changes
1 5 265
1 1 200
1 3 500
24 25 175
21 25 180
3 1 250
3 3 147
I don't understand what happen.
Can you help me, please?
It is possible that the column types are factors, in that case, convert it to numeric and should work
library(dplyr)
t2 %>%
arrange_at(1:2, ~ desc(as.numeric(as.character(.))))
Or with base R
t2[1:2] <- lapply(t2[1:2], function(x) as.numeric(as.character(x)))
t2[do.call(order, c(t2[1:2], decreasing = TRUE)), ]
Or the OP's code should work as well
Noticed that decreasing = False in the first option OP tried (may be a typo). In R, it is upper case, FALSE
t2[order(t2$Initial, t2$Final, decreasing=FALSE),]

List/Num: Means results in argument is not numeric or logical: returning NA

I made a list of 100 numbers from 100 permutations using the following code:
for(i in 3:length(GeneData)) {
# Grab the 37 observations in the gene
for(j in 1:37) {
genePerm[j] <- GeneData[j, i]
}
# 100 permutations
for(k in 1:100) {
genePerm <- sample(genePerm)
newA <- vector()
newB <- vector()
# 23 GeneA, 14 GeneB
for(l in 1:37) {
if(l < 24) {
newA[l] <- genePerm[l]
} else {
newB[l - 23] <- genePerm[l]
}
}
testChair[k] <- t.test(newA, newB)$p.value
}
permPValues[i - 2] <- mean(testChair)
}
The dataset is 1144 genes, each with 37 observations. So the goal was to run a for loop over every gene (3:1147), then grab all observations 1:37, put them in a random sample of A/B of respective sizes above. Grab the P-Value from the t.test of each, and then run this permutation 100x. I wanted to then save the mean of those P-Values in the last statement before running the next gene.
When I run this, I get over 50 warnings(): "In mean.default(testChair) : argument is not numeric or logical: returning NA"
class(testChair)
[1] "list"
str(testChair)
List of 100 num
A small example of data:
Patient Class 1405_i_at 200030_s_at 200062_s_at
1 A 7.492258127 12.45140014 13.48393678
2 A 7.899176081 12.5622002 13.5087836
3 A 8.668834124 11.84527253 13.04315946
4 A 9.91313519 12.04905336 13.42893925
5 A 6.047972634 12.77472603 13.63448007
6 A 8.368976147 12.24842422 13.41529093
7 A 7.20836421 12.93198614 13.13948227
8 A 7.919268023 12.28964756 13.2397595
9 A 9.003944903 11.68261433 13.08924549
10 A 8.084666646 12.18271975 13.41492561
11 A 9.840143746 12.34283432 12.96360228
12 A 6.99033759 12.57560342 13.2587594
13 A 7.978610388 12.54610449 13.1813085
14 A 8.747926877 11.8624978 13.24270026
15 A 9.486237216 12.01866821 13.33474933
16 A 8.760956899 12.49961585 13.65201122
17 A 8.382102061 12.04592178 13.39857364
18 A 8.073885916 12.3062644 13.76705502
19 A 9.412851349 12.39448144 13.16896019
20 A 8.364735507 12.42232335 13.33859203
21 A 9.157396203 12.09875546 13.27433327
22 A 5.913420688 12.2529661 13.62515813
23 A 7.162806841 11.99874653 13.36516341
24 B 8.734138362 12.17417605 13.43827062
25 B 6.785358473 11.98805352 13.44471807
26 B 7.252916328 12.57831463 13.29748783
27 B 8.988631789 11.9242628 13.88455123
28 B 8.36856432 11.84448206 13.22715915
29 B 6.430597552 12.26866611 13.35831894
30 B 7.092622736 11.86263629 13.22220515
31 B 6.708408743 11.77572547 13.4647575
32 B 6.700810798 11.80986457 13.52356174
33 B 7.032134704 12.25581888 13.34332883
34 B 7.731854575 12.13688324 13.3219734
35 B 7.71243075 11.92484732 13.13499252
36 B 6.293368361 12.07172977 13.39202083
37 B 7.992184287 12.2412432 13.33908972

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

how to do assignment of numbers in r (one machine n jobs)

I am working on assignment problem in R. I have following dataframe in r
cycle_time TAT ready_for_next ITV_no
2 10 12 0
4 12 16 0
6 13 19 0
8 11 19 0
10 15 25 0
12 17 29 0
14 13 27 0
16 13 29 0
18 12 30 0
20 16 36 0
22 13 35 0
24 12 36 0
26 15 41 0
28 14 42 0
30 17 47 0
My desired dataframe would be
cycle_time TAT ready_for_next ITV_no wait_time
2 10 12 1 0
4 12 16 2 0
6 13 19 3 0
8 11 19 4 0
10 15 25 5 0
12 17 29 1 0
14 13 27 6 0
16 13 29 2 0
18 12 30 3 1
20 16 36 4 1
22 13 35 5 3
24 12 36 6 3
26 15 41 2 3
28 14 42 3 2
30 17 47 5 5
cycle_time = crane cycle time
TAT(in mins) = turn around time of truck
ready_for_next(in mins) = ready to take next container
ITV_no = ITV no to be assigned for that job
***There are only 6 unique trucks available***
Idea here is to assign trucks such that waiting time is minimum.
In first five observations all 5 trucks are assigned.
For the next container i.e row number 6 (on 12th min) ITV_no 1 is coming back from its job so that will get assigned to this job.
7th observation(i.e 14th min) there are no trucks available,so we will have to assign new truck (i.e ITV_no 6)
8th observation(16 min) ITV_no 2 is coming back from its job,so that will get assigned to this job and so on.
If there are no trucks available then it has to wait till the nearest truck comes back from job.
How can I implement this in R?
I have build some logic
cycle_time <- c(2,4,6,8,10,12,14,16,18,20,22,24,26,28,30)
ITV_no <- c(1,2,3,4,5,6,7)
temp <- c()
TAT <- c(10,12,13,11,15,17,13,13,12,16,13,12,15,14,17)
ready_for_next <- cycle_time + TAT
assignment <- data.frame(cycle_time,TAT,ready_for_next)
assignment$ITV_no <- 0
for(i in 1:nrow(assignment)) {
for(j in 1:length(ITV_no)){
assignment$ITV_no[i] <- ifelse(assignment$cycle_time <= assignment$ready_for_next,ITV_no[j],
ifelse())
## I am not able to update the count of trucks which are already assigned
# and which are free to be assigned
}
}
Logic
1. first row increment ITV_no by 1. directly assign truck to that job
2. check if cycle_time <= previous all ready_for_next(i.e 12), if yes then increment ITV_no by 1,if no then assign previous ITV_no for that job(i.e 1)
e.g
for row 6, cycle time will get compared to all previous ready_for_next column values (25,19,19,16,12) it finds the match at first row then that ITV_no(i.e 2) is assigned to 6th row
for row 7, cycle time will get compared to all previous ready_for_next column values (25,19,19,16) **12 should be removed from comparison because the truck is already assigned to the job** match at first row then that ITV_no(i.e 2) is assigned to 6th row. No match,so new truck is assigned to that job
I have come up with some solution...
It is working with sample data
rm(list=ls())
df <- data.frame(qc_time = seq(2,40,2),itv_tat=c(10,15,12,18,25,19,18,16,14,10,12,15,17,19,13,12,8,15,9,14))
itv_number_vec <- vector()
itv_number_vec <- 0
itvno_time <- list()
for (i in 1:nrow(df))
{
#### Initialisation ####
if (i==1)
{
df$itv_available_time[i] <- sum(df$qc_time[i] + df$itv_tat[i])
itvno_time[[i]] <- df$itv_available_time[i]
df$delay[i] <- 0
df$itv_number[i] <- 1
itv_number_vec <- 1
}
if(i!=1)
{
if (df$qc_time[i] >= min(unlist(itvno_time)))
{
for (j in 1:length(itvno_time))
{
if (itvno_time[[j]] <= df$qc_time[i])
{
df$itv_number[i] <- j
df$itv_available_time[i] <- sum(df$qc_time[i] + df$itv_tat[i])
itvno_time[[j]] <- df$itv_available_time[i]
break
}
}
}else{
if (max(itv_number_vec)<7)
{
df$itv_number[i] <- max(itv_number_vec) + 1
itv_number_vec <- c(itv_number_vec,(max(itv_number_vec) + 1))
df$delay[i] <- 0
df$itv_available_time[i] <- sum(df$qc_time[i] + df$itv_tat[i])
itvno_time[[max(itv_number_vec)]] <- df$itv_available_time[i]
}else{
df$delay[i] <- (min(unlist(itvno_time)) - df$qc_time[i])
df$itv_number[i] <- which.min(itvno_time)
df$itv_available_time[i] <- sum(df$qc_time[i], df$itv_tat[i] ,df$delay[i])
itvno_time[[which.min(itvno_time)]] <- df$itv_available_time[i]
}
}
}
}

How to use apply function instead of for loop if you have multiple if conditions to be excecuted

1st DF:
t.d
V1 V2 V3 V4
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
names(t.d) <- c("ID","A","B","C")
t.d$FinalTime <- c("7/30/2009 08:18:35","9/30/2009 19:18:35","11/30/2009 21:18:35","13/30/2009 20:18:35","15/30/2009 04:18:35")
t.d$InitTime <- c("6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35","6/30/2009 9:18:35")
>t.d
ID A B C FinalTime InitTime
1 1 6 11 16 7/30/2009 08:18:35 6/30/2009 9:18:35
2 2 7 12 17 9/30/2009 19:18:35 6/30/2009 9:18:35
3 3 8 13 18 11/30/2009 21:18:35 6/30/2009 9:18:35
4 4 9 14 19 13/30/2009 20:18:35 6/30/2009 9:18:35
5 5 10 15 20 15/30/2009 04:18:35 6/30/2009 9:18:35
2nd DF:
> s.d
F D E Time
1 10 19 28 6/30/2009 08:18:35
2 11 20 29 8/30/2009 19:18:35
3 12 21 30 9/30/2009 21:18:35
4 13 22 31 01/30/2009 20:18:35
5 14 23 32 10/30/2009 04:18:35
6 15 24 33 11/30/2009 04:18:35
7 16 25 34 12/30/2009 04:18:35
8 17 26 35 13/30/2009 04:18:35
9 18 27 36 15/30/2009 04:18:35
Output to be:
From DF "t.d" I have to calculate the time interval for each row between "FinalTime" and "InitTime" (InitTime will always be less than FinalTime).
Another DF "temp" from "s.d" has to be formed having data only within the above time interval, and then the most recent values of "F","D","E" have to be taken and attached to the 'ith' row of "t.d" from which the time interval was calculated.
Also we have to see if the newly formed DF "temp" has the following conditions true:
here 'j' represents value for each row:
if(temp$F[j] < 35.5) + (temp$D[j] >= 100) >= 1)
{
temp$Flag <- 1
} else{
temp$Flag <- 0
}
Originally I have 3 million rows in the dataframe and 20 columns in each DF.
I have solved the above problem using "for loop" but it obviously takes 2 to 3 days as there are a lot of rows.
(Also if I have to add new columns to the resultant DF if multiple conditions get satisfied on each row?)
Can anybody suggest a different technique? Like using apply functions?
My suggestion is:
use lapply over row indices
handle in the function call your if branches
return either your dataframe or NULL
combine everything with rbind
by replacing lapply with mclapply from the 'parallel' package, your code gets executed in parallel.
resultList <- lapply(1:nrow(t.d), function(i){
do stuff
if(condition){
return(df)
}else{
return(NULL)
}
resultDF <- do.call(rbind, resultList)

Resources