first, I got 2 feature which are character initially.
train_address = train$address
test_address = test$address
and then I bind them together.
address = c(train_address, test_address)
and then I change it from character to integer because I will dummy them later and I want to process it faster.(those character are not in English)
train_address = as.integer(factor(train_address, levels = unique(address)))
test_address = as.integer(factor(test_address, levels = unique(address)))
and now, here is the problem. code is shown below.
My goal is to transfer all the data which in train but not in test to 0.
for (a in train_address) {
if (!(train_address[a] %in% test_address)) {
train_address[a] = 0
}
}
train_address = as.factor(train_address)
test_address = as.factor(test_address)
after I process the data in this way, it should be:
the number of factor of test + 1 = the number of factor of train
(because R start from 1 so 0 is not been used until I transfer some of the data in train via the for loop above)
but in reality, difference between the number of factor of train and of test is 400+.
I know there must be something wrong about the code but I don't know where...
Following should do the trick.
You don't need loop for this but use vectorized manipulation.
train_address[!(train_address %in test_address)] <- 0
Explanation :
(train_address %in test_address) gives boolean vector where TRUE means to element in train_address is in test_address
! negates that boolean vector
train_address[!(train_address %in test_address)] gives all the elements in train_address that are not in test_address.
finally you set them to zero by our command train_address[!(train_address %in test_address)] <- 0
Related
I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.
I have the following variables: CFNAIdiff(first differenced), HOUSTgr, INDPROgr, UMCSENTgr, and UNRATEgr(which are growth rates). I want to build an AR model and I am trying to construct a data frame in the following way:
dataframe <- data.frame(y = INDPROgr[2:T], INDPROgr = INDRPOgr[1:(T-1)],
HOUSTgr = HOUSTgr[1:(T-1)], UMCSENTgr = UMCSENTgr[1:(T-1)],
UNRATEgr = UNRATEgr[1:(T-1)], CFNAIdiff = CFNAIdiff[1:(T-1)])
However, I encounter the following problem:
Error in INDPROgr[1:(T - 1)] :
only 0's may be mixed with negative subscripts
What am I specifying wrong?
The error is stating that you are trying to subset both positive and negative numbers. Lets make a simple example
dat <- data.frame(A = LETTERS[1:10], B = 1:10)
We can subset the data.frame in this example using standard methods as you are doing in your own code
dat[0:3,]
which will return the first 3 rows. Here 0 is treated as empty row, and thus returns an empty row (different from a row with nulls)
dat[0,]
Now if we by a mistake end up subsetting by lets say a variable T, and this for some reason is 0 or negative you will get an error, if you want to return any specific rows. This is in turn the case to avoid any conflicts such as
dat[c(-1,1),]
which technically is trying to return the entire data frame minus the first row, but including the first row equivalent to rbind(dat[-1,], dat[1,]).
So if we have some function or script that subsets alike your script
dataframe<- data.frame( y = INDPROgr[2:T],
INDPROgr = INDRPOgr[1:(T-1)],
HOUSTgr = HOUSTgr[1:(T-1)],
UMCSENTgr = UMCSENTgr[1:(T-1)],
UNRATEgr = UNRATEgr[1:(T-1)],
CFNAIdiff = CFNAIdiff[1:(T-1)])
R will return an error in the case that T is either 0 as T-1 = -1 meaning you are subsetting 1:(-1), or alternatively if T itself is negative, for the same reasons.
As such i suggest checking if T becomes negative or zero somewhere in your code.
I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!
This is my first post so I hope it is not too elementary. I am trying to match observations which have a negative Amount to counterparts that have a positive Amount and an equal abs(Amount). Furthermore, I want to check that the Amounts are both from the same Account. To do this, I am trying to use a for loop, but am getting the following error: "Operations are possibly only for numeric, logical or complex types." This is my code so far:
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
if ((data$Amount[i]=abs(data$Amount[j]))&(data$Amount[i]!=data$Amount[j])&(data$Account[i]=data$Account[j]))
{data$debit[i]<-1}}}
Does anyone have any idea why this is happening, or know of a better way using the Apply function family? Thank you in advance!
EDIT:
Below is a toy data set: to illustrate this example. For instance, on this data set, I want to create an indicator variable which would be 0 except for ID=3 because for the observation, 4.7=abs(-4.7) and "abc1"="abc1" .
Data <- " ID Amount Account
1 5.0 abc1
2 -5.0 abc9
3 4.7 abc1
4 4.6 abc7
5 5.0 abc8
6 -4.7 abc1 "
Here's an alternative method of achieving the same result with a lot less code (and I think it's easier to read too)
library(dplyr)
Data <- Data %>%
group_by(Account) %>%
mutate(
debit = (Amount > 0 & -Amount %in% unique(Amount)) * 1
) %>%
ungroup()
If you aren't familiar with the pipe operator (%>%), it allows us to avoid nesting a lot of functions inside one another. It works by taking the output of the previous function, and entering it as the first argument of the next function. So this code takes the data set (Data), groups it by the Account, adds a new column with the indicator variable with the desired criterion, and then ungroups the data so it's back to its normal format.
The looping is done within these function calls, which allows them to be implemented in compiled languages (usually C++) - which can be a lot faster than R.
You need to use the == operator (= is an assignment operator) and the && rather than the & operator for your logical condition:
## Assignment (incorrect in this case!)
1 = 1
# Error in 1 = 1 : invalid (do_set) left-hand side to assignment
a <- 1
a = a
Note that with a = a there is no logical checked (just the equivalent of a <- a; see more here).
## Checking equivalence (returns a logical)
1 == 1
# [1] TRUE
a == a
# [1] TRUE
For the difference between & and &&, the second evaluates the full condition and the first each element (see here).
Also it might be more elegant to check whether the sum of data$Amount[i] and data$Amount[j] is null rather than to check if they have the first absolute value but not the same signed value.
## Your example
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
if ( (sum(c(data$Amount[i], data$Amount[j])) == 0) && (data$Account[i] == data$Account[j]) ) {
data$debit[i]<-1
}
}
}
I have the following function "cOrder"
library(MASS)
cOrder=function(anm,sir,dam){
maxloop=1000
i = 1
count = 0
mam=length(anm)
old = rep(1,mam)
new = old
while(i>0){
for (j in 1:mam){
ks = sir[j]
kd = dam[j]
gen = new[j]+1
if(ks != "NA"){
js = match(ks,anm)
if(gen > new[js]){new[js] = gen} #where error occurs
}
if(kd != "NA"){
jd = match(kd,anm)
if(gen > new[jd]){new[jd] = gen}
}
} # for loop
changes = sum(new - old)
old = new
i = changes
count = count + 1
if(count > maxloop){i=0}
} # while loop
return(new)
} # function loop
which works brilliantly when imputting the following
dataset:
animal=c("bf","dd","ga","ec","fb","ag","he")
sire=c("dd","ga","NA","ga","NA","bf","dd")
dams=c("he","ec","NA","fb","NA","ec","fb")
gg=cOrder(animal,sire,dams)
but crashes and burns with the following:
animal=c("67947887","67947986","67948372","67948877","67948927","67949057","67950873","67951186","67951285","67951384","67951400","67951525","67951681","68045244","68045657","69999837","77542587","77542629","78468170","79879946")
sire=c("45334307","45334307","40684433","38121933","38141933","40684433","43339787","38431722","40684433","43339787","34931873","40684433","34931873","67951525","67951525","67950873","67951400","67951384","NA","67951681")
dams=c("37084407","25565110","36817369","21897145","21897145","20138814","32629901","37485356","25731548","32129629","31795768","37588084","36812355","68040013","68040500","68040443","67951855","67950980","67949065","67948307")
gg=cOrder(animal,sire,dams)
>Error in if (gen > new[js]) { : missing value where TRUE/FALSE needed
Both of these are inputted as character vectors, so I don't think it is a matter of whether the one set have characters and the other numeric digits. Or could it? Have also tried to make them numeric, import from a .csv, unlist them, etc. Error code stays the same.
My individual names generally consist of 8-digit numeric codes, any suggestions towards preventing this error, or renaming my whole population?
Thanks!
EDIT
The way the datasets are setup is as follows: the first animal in the vector is the offspring of the first dam and sire in their respective vectors. Thus, according the the simple set, bf is the offspring of dd and he, dd of ga and ec, and the parents of ga are unknown.
The idea behind this function is to determine the "oldest" animal/s in the dataset, i.e., the ones with the least number of generations, and eventually in succeeding code order them accordingly and generate a relationship matrix. So it is supposed to be OK if an animal does not appear in the sire list; it means that it is an older animal. So the code is supposed to move on to the next. Which it does in the simple set, but not in the proper one. Any ideas?
Thanks!
It is because your first sire value (45334307) doesn't match anything in your animal list, so match() returns an NA.