This might be very simple, but I am not able to get how to fix this problem. Basically I need to calculate growth for multiple columns. So when I am dividing by a column, if it has 0 value it returns Inf
Let me take a example data set
a <- c(1,0,3,4,5)
b <- c(1,4,2,0,4)
c <- data.frame(a,b)
c$growth <- b/a-1
So if you see for the 2nd row since a is having 0 the growth is Inf. It should display 4
My original data is in data.table so any solution in data.table would help.
How can we fix this?
I don't know why you want to turn Inf to 4. In my opinion it doesn't make any sense as the growth is not 4 is Inf. However, if you still want to do that here's some code:
a <- c(1,0,3,4,5)
b <- c(1,4,2,0,4)
data <- data.frame(a,b)
data$growth <- b/a-1
data[data$growth == Inf,3] <- data[data$growth == Inf,2]
Related
I am using R and I have a correct vector whose names contain all the target names (names(correct) <- c("A","B","C","D","E")) such as:
correct <- c("a","b","c","d","e")
names(correct) <- c("A","B","C","D","E")
The vector I have to modify, instead, tofix, has names that may miss some values compared to correct above, in the example below is missing "C" and "E".
tofix <- c(2,5,4)
names(tofix) <- c("A","B","D")
So I want to fix it in a way that the resulting vector, fixed, contains the same names as in correct and with the same order, and when the name is missing adds 0 as a value, like the below:
fixed <- c(2,5,0,4,0)
names(fixed) <- names(correct)
Any idea how to do this in R? I tried with multiple if statements and for loops, but time complexity was far from ideal.
Many thanks in advance.
You may try
fixed <- rep(0, length(correct))
fixed[match(names(tofix), names(correct))] <- tofix
names(fixed) <- names(correct)
fixed
A B C D E
2 5 0 4 0
unlist(modifyList(as.list(table(names(correct))*0), as.list(tofix)))
A B C D E
2 5 0 4 0
So I am working with biological data at a hospital, (I won't disclose anything here but I won't need to in order to ask this question). We are looking at concentrations of antibodies taken a certain amount of time. There are, for one reason or another, missing data points all over our data set. What I am doing is trying to remove the missing data points along with their corresponding time. Right now the basic goal is just to get some basic graphs and charts up and running but eventually we're going to want to create some logistical models and nonlinear dynamics models which we'll do in another language.
1) First I put my data into a vector along with it's corresponding time:
data <- read.csv("blablabla.csv" header = T)
Biomarker <- data[,2]
time <- data[,1]
2)Then I sort the data:
Biomarker <- Biomarker[order(time)]
time <- sort(time, decreasing = F)
3)Then I put the indexes of the NA values into a vector
NA_Index <- which(is.na(Biomarker))
4)Then I try to remove the data points at that index for both the biomarker and time vector
i <- 1
n <- length(NA_Index)
for(i:n){
Biomarker[[NA_Index[i]]] <- NULL
time[[NA_Index[i]]] <- NULL
}
Also I have tried a few different things than the one above:
1)
Biomarker <- Biomarker[-NA_Index[i]]
2)
Biomarker <- Biomarker[!= "NA"]
My question is: "How do I remove NA values from my vectors and remove the time with the same index?"
So Obviously I am very new to R and might be going about this in a completely wrong. I just ask that you explain all what all the functions do if you post some code. Thanks for the help.
First I'd recommend storing your data in a data.frame instead of two vectors, since the entries in the vecotors correspond to cases this is a more appropriate datastructure.
my_table <- data.frame(time=time, Biomarker=Biomarker)
Then you can simply subset the whole data.frame, the first dimension are rows, the second columns, as usual, leave the second dimension free to keep all columns.
my_table <- my_table[!is.na(my_table$Biomarker), ]
> BioMarker
[1] 1 2 NA 3 NA 5
> is.na(BioMarker)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> BioMarker[is.na(BioMarker)]
[1] NA NA
> BioMarker[! is.na(BioMarker)]
[1] 1 2 3 5
> BioMarker <- BioMarker[! is.na(BioMarker)]
> BioMarker
[1] 1 2 3 5
I have a dataframe with 5 second intraday data of a stock. The dataframe exists of a column for the date, one for the time and one for the price at that moment.
I want to make a new column in which it calculates the ratio of two consecutive price values.
I tried it with a for loop, which works but is really slow.
data["ratio"]<- 0
i<-2
for(i in 2:nrow(data))
{
if(is.na(data$price[i])== TRUE){
data$ratio[i] <- 0
} else {
data$ratio[i] <- ((data$price[i] / data$price[i-1]) - 1)
}
}
I was wondering if there is a faster option, since my dataset contains more than 500.000 rows.
I was already trying something with ddply:
data["ratio"]<- 0
fun <- function(x){
data$ratio <- ((data$price/lag(data$price, -1))-1)
}
ddply(data, .(data), fun)
and mutate:
data<- mutate(data, (ratio =((price/lag(price))-1)))
but both don't work and I don't know how to solve it...
Hopefully somebody can help me with this!
You can use the lag function to shift the your data by one row and then take the ratio of the original data to the shifted data. This is vectorized, so you don't need a for loop, and it should be much faster. Also, the number of lag units in the lag function has to be positive, which may be causing an error when you run your code.
# Create some fake data
set.seed(5) # For reproducibility
dat = data.frame(x=rnorm(10))
dat$ratio = dat$x/lag(dat$x,1)
dat
x ratio
1 -0.84085548 NA
2 1.38435934 -1.64637013
3 -1.25549186 -0.90691183
4 0.07014277 -0.05586875
5 1.71144087 24.39939227
6 -0.60290798 -0.35228093
7 -0.47216639 0.78314834
8 -0.63537131 1.34565131
9 -0.28577363 0.44977422
10 0.13810822 -0.48327840
for loop in R can be extremely slow. Try to avoid it if you can.
datalen=length(data$price)
data$ratio[2:datalen]=data$price[1:datalen-1]/data$price[2:datalen]
You don't need to do the is.NA check, you will get NA in the result either the numerator or the denominator is NA.
I am trying to use R to run a student t-test and a chi squared test with large data sets. Since I am fairly new to R my inexperience has been preventing much success in my own code.
Both data sets have missing data and look something like this:
AA assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE 0 12.2
chemical 2 TRUE 0
chemical 3 45.2 35.6
chemical 4 FALSE 0 0
AB assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE FALSE TRUE
chemical 2 TRUE FALSE
chemical 3 TRUE TRUE
chemical 4 FALSE FALSE FALSE
Since it is a large data set I am trying to create a code where I can compare assayX to all assayYs. I'm hoping to create a student t-test loop for the first data set, and a chi squared loop to come the second data set. I had previously been successful creating a loop code for a correlation analysis, so I based my code off of that idea.
x<- na.omit(mydata1[, c(assayX)])
y<- na.omit(mydata1[, c(assayY1:assayYend)])
lapply(y, function(x)t.test(y~x))
x<-na.omit(mydata2[, c(assayX)])
y<- na.omit(mydata2[, c(assayY1:assayYend)]
lapply(y, x=x, chisq.test)
Problem with the first code is:
Invalid variable y
Problem with the second code is:
x and y must have the same length
I've done small tweaks here and there and have just got different types of errors like not enough 'y' observations and so on. I've been primarily using this site to figure out how work R, so I'm hoping you guys will have a clever little solution for a new guy.
After a long time and gaining experience in R, I can answer my own question. First is to make the datafile change blanks to NA.
df1 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))
Then for the student.t
df1.p= rep(NA, 418)
for (i in length(df1$Assays)){
test= t.test(df1[,c(i)]~df1$assay.activity)
current.p.val= test$p.value
p.df1[i]=current.p.val
}
Then to add a Pearson's or Chi sq (not actually appropriate for this dataset, but just as an ex)
df1.p.2= rep(NA, length(df1$Assays))
df1.r.2= rep(NA, length(df1$Assays))
for (i in length(df1$Assays)){
test2= cor.test(df1$assay.activity, df1[,c(i)], mehtod='pearson')
current.p.val2= test2$p.value
current.rval = test2$estimate
df1.p.2[i] = current.p.val2
df1.r.2[i] = current.rval
}
df2= cbind(df1$Assays, df1.p, df1.p.2, df1.r.2)
I then filtered it for only assays with 0.1 significance, but that wasn't the question here. If you want to know that, just ask a question and I'll post an answer there :)
I don't think your data is being passed correctly to the test. t.test has arguments for whether the data is paired or not (default is false) and how to handle NAs should you want to change from the defalt. You should probably use those rather than omit NAs up front. An example with NAs in the data:
set.seed(1)
y <- runif(30, 0, 1)
y.NA <- c(3,24,27)
y[y.NA] <- NA
x <- runif(30, 0, 1)
x.NA <- c(1,3,8,12,21)
x[x.NA] <- NA
t.test(x,y)
For chisq.test you can use the table function.
chisq.test(table(x,y))$p.value
I was wondering if there is an simply way check if my zero values in my data are excluded in my anova.
I first changed all my zero values to NA with
BFL$logDecomposers[which(BFL$logDecomposers==0)] = NA
I'm not sure if 'na.action=na.exclude' makes sure my values are being ignored(like I want them to be)??
standard<-lm(logDecomposers~1, data=BFL) #null model
ANOVAlnDeco.lm<-lm(logDecomposers~Species_Number,data=BFL,na.action=na.exclude)
anova(standard,ANOVAlnDeco.lm)
P.S.:I've just been using R for a few weeks, and this website has been of tremendous help to me :)
You haven't given a reproducible example, but I'll make one up.
set.seed(101)
mydata <- data.frame(x=rnorm(100),y=rlnorm(100))
## add some zeros
mydata$y[1:5] <- 0
As pointed out by #Henrik you can use the subset argument to exclude these values:
nullmodel <- lm(y~1,data=mydata,subset=y>0)
fullmodel <- update(nullmodel,.~x)
It's a little confusing, but na.exclude and na.omit (the default) actually lead to the same fitted model -- the difference is in whether NA values are included when you ask for residual or predicted values. You can try it out:
mydata2 <- within(mydata,y[y==0] <- NA)
fullmodel2 <- update(fullmodel,subset=TRUE,data=mydata2)
(subset=TRUE turns off the previous subset argument, by specifying that all the data should be included).
You can compare the fits (coefficients etc.). One shortcut is to use the nobs method, which counts the number of observations used in the model:
nrow(mydata) ## 100
nobs(nullmodel) ## 95
nobs(fullmodel) ## 95
nobs(fullmodel2) ## 95
nobs(update(fullmodel,subset=TRUE)) ## 100