Append row to dataset in R - r

I'm trying to create a new dataset deleting some rows (trough a comparison with a dataset ds1) from ds2. I wrote a function that should do this:
compare<-function(ds1,ds2){
for(i in 1:length(ds1$long)){
for(j in 1:length(ds2$long)){
if(ds1$long[i]<(ds2$long[j]+500) & ds1$long[i]>(ds2$long[j]-500)){
if(ds1$lat[i]<(ds2$lat[j]+500) & ds1$lat[i]>(ds2$lat[j]-500)){
ds3<-data.frame(merge(ds2[j,],ds3))
}
}
}
}
return(ds3)
}
ds3 is the dataset I want to return, it should be formed by the rows of the original dataset ds2 that satisfy the condition above.
My function gives me an error:
Error in as.data.frame(y) :
argument "y" is not specified and has not a definite value
Is "merge()" the right function for creating such a dataset, appending rows to ds3?
If not, which is the right function to do this?
Thank you all in advance
Edit: I modified the function thanks to your tips, using
ds3<-data.frame()
ds3<-rbind(ds3,ds2[j,])
instead of
ds3<-data.frame(merge(ds2[j,],ds3))
Now I've got this error:
Errore in rbind(ds3, ds2[j, ]) :
no method for coercing this S4 class to a vector
If I use rbind(), can I operate with SpatialPoints? (data contained in my dataset are spatial points)
Edit2: I have 2 datasets, one with 330 rows (points on irregular grid, ds1), one with ~150000 rows (points on regular grid, ds2). I want to compute correlation between the variables in the first dataset and the variables in the second one. For making it, I want to "reduce" the second dataset to the dimensions of the first, saving only the points which have the same coordinates (or quasi) in both datasets.

Without a small example this has no testing but if you are happy with the performance of the for-loop then this may be what you are attempting:
compare<-function(ds1,ds2){
for(i in 1:length(ds1$long)){
for(j in i:length(ds2$long)){ # I think starting at 1 will give twice as many hits
if(ds1$long[i]<(ds2$long[j]+500) & ds1$long[i]>(ds2$long[j]-500)){
if(ds1$lat[i]<(ds2$lat[j]+500) & ds1$lat[i]>(ds2$lat[j]-500)){
if( length(d3) ) { # check to see if d3 exists or not
ds3<-rbind( ds3, ds2[,j] ) } else { # append as the next row
d3 <- ds2[ ,j] } # should only get executed once
}
}
}
}
return(ds3)
}
I tried to avoid the added overhead of retesting for j,i matches where you already had an i,j match. Again, I cannot tell for sure this is appropriate because the problem description still is not exactly clear to me.

Related

How to change data within a column in a dataset in R

I have created a for loop that goes through each row of a certain column. I want to change the information written in that cell depending on certain conditions, so I implemented an if/else statement.
However, the current problem is that the data is printing out one specific outcome: B.
I tried to combat this problem by exporting using write.csv and importing using read.csv.
When I applied the head() function though, I still got Medium for all rows.
Would anyone be able to help with this please?
Walkthrough the following example step by step. You need to assign for loop variable correctly. Could you show us the data frame where you are changing values? That would be helpful.
#creating new data frame
Df <- data.frame(a=c(1,2,3,4,5),b=c(2,3,5,6,8),c=c(10,4,2,3,7))
for (k in 1:dim(Df)[1]) {
#see how k is utilised and Df$Newcolumn creates new column in existing dataframe
if (Df$a[k]<=3) {
Df$Newcolumn[k] <- "low"
}else if (Df$a[k]>3 && Df$a[k]<=6) {
Df$Newcolumn[k] <- "medium"
}
}
you do not need to use a for loop for creating a new column based upon conditions. You could simply use this:
cool$b<-cool$a
cool$b[cool$a <3]<-"low"
cool$b[cool$a >= 3 & school_data2019$Taxable.Income< 4]<-"Medium"
cool$b[cool$a >= 4 & school_data2019$Taxable.Income < 5]<-"Rich"
cool$b[cool$a >5]<-"Very Rich"

How to create a new variable in R that returns 1 if a case has a missing value while another variable has an observed value?

I have two variables containing missing data loon and profstat. For a better overview of the data that are missing and are needed to impute, I wanted to create an additional variable problem in the data frame, that would return for each case 1 if loon is missing and profstat is observed, and 0 if otherwise. I have generated the following code, which only gives me as output x[] = 1. Any solution to this problem?
{
problem <- dim(length(t))
for (i in 1:nrow(dflapopofficial))
{
if (is.na(dflapopofficial$loon[i])==TRUE & is.na(dflapopofficial$profstat[i])==FALSE) {
dflapopofficial$problem[i]=1
} else {
dflapopofficial$problem[i]=0
}
return(problem)
}
There are a few things that could be improved here:
Remember, many operations in R are vectorized. You don't need to loop through each element in a vector when doing logical checks etc.
is.na(some_condition) == TRUE is just the same as is.na(some_condition) and is.na(some_condition) == FALSE is the same as !is.na(some_condition)
If you want to write a new column inside a dataframe, and you are referring to several variables in that dataframe, using within can save you a lot of typing - particularly if your dataframe has a long name
You are returning problem, yet in your loop, you are writing to dflapipofficial$problem which is a different variable.
If you want to write 1s and 0s, you can implicitly convert logical to numeric using +(logical_vector)
Putting all this together, you can replace your whole loop with a single line:
within(dflapopofficial, problem <- +(is.na(loon) & !is.na(profstat)))
Remember to store the result, either back to the dataframe or to a copy of it, like
df <- within(dflapopofficial, problem <- +(is.na(loon) & !is.na(profstat)))
So that df is just a vopy of dflapopofficial with your extra column.

How do I run a for loop over all columns of a data frame and return the result as a separate data frame or matrix

I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))

How to manipulate a count matrix from a DGEList?

I am currently doing an RNASeq differential expression analysis. I used the function DGEList from edgeR to obtain the count and sample objects. I now want to remove a list of genes from count. This is the code I tried (with remove the list of genes I want to remove, gene the reference fo all genes I have):
n=0
for (i in remove) {
for (j in gene) {
n=(n+1)
if (i==j) {
counts=counts[-n, ]
n=(n-1)
}
if (n==nrow(counts)) {
n=0
}
}
}
I was expecting it to work as it does properly on a matrix that is similar.The code is still running while the one working on the matrix finished a long time ago. It should remove about 16000 rows.
Do I have to manipulate it in a different way ?
If I understand correctly, you want to filter out some genes from your count matrix. In that case instead of the loops, you could try indexing the counts object. Assuming the entries in diff match some entries in rownames(counts), you could try:
counts_subset <- counts_all[which(!rownames(counts_all) %in% diff),]
A similar approach should work on the table obtained by running the LRT test (result$table). This would be better object to filter.

How to counter the 'non-numeric matrix extent' error in R?

I'm trying to generate a data frame of simulated values from the student's t distribution using the standard stochastic equation. The function I use is as follows:
matgen<-function(means,chi,covariancematrix)
{
cols<-ncol(means);
normals<-mvrnorm(n=500,mu=means,Sigma = covariancematrix);
invgammas<-rigamma(n=500,alpha=chi/2,beta=chi/2);
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=500));
i<-1;
while(i<=500)
{
gen[i,]<-t(means)+normals[i,]*sqrt(invgammas[i]);
i<=i+1;
}
return(gen);
}
If it's not clear, I'm trying to create an empty data frame, that takes in values in cols number of columns and 500 rows. The values are numeric, of course, and R tells me that in the 9th row:
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=500));
There's an error: 'non-numeric matrix extent'.
I remember using as.data.frame() to convert matrices into data frames in the past, and it worked quite smoothly. Even with numbers. I have been out of touch for a while, though, and can't seem to recollect or find online a solution to this problem. I tried is.numeric(), as.numeric(), 0s instead of NA there, but nothing works.
As Roland pointed out, one problem is, that col doesn't seem to be numeric. Please check if means is a dataframe or matrix, e.g. str(means). If it is, your code should not result in the error: 'non-numeric matrix extent'.
You also have some other issues in your code. I created a simplified example and pointed out the bugs I found as comments in the code:
library(MASS)
library(LearnBayes)
means <- cbind(c(1,2,3),c(4,5,6))
chi <- 10
matgen<-function(means,chi,covariancematrix)
{
cols <- ncol(means) # if means is a dataframe or matrix, this should work
normals <- rnorm(n=20,mean=100,sd=10) # changed example for simplification
# normals<-mvrnorm(n=20,mu=means,Sigma = covariancematrix)
# input to mu of mvrnorm should be a vector, see ?mvrnorm; but this means that ncol(means) is always 1 !?
invgammas<-rigamma(n=20,a=chi/2,b=chi/2) # changed alpha= to a and beta= to b
gen<-as.data.frame(matrix(data=NA,ncol=cols,nrow=20))
i<-1
while(i<=20)
{
gen[i,]<-t(means)+normals[i]*sqrt(invgammas[i]) # changed normals[i,] to normals [i], because it is a vector
i<-i+1 # changed <= to <-
}
return(gen)
}
matgen(means,chi,covariancematrix)
I hope this helps.
P.S. You don't need ";" at the end of every line in R

Resources