I am very new to coding.
I'm looking to retrieve frequency values of rows in my data frame
I already know you can do this using:
df$col <- rowSums( data[,0:100] )
But I specifically want the sum of data from rows that are divisible by two, in other words, even rows up to a specific point in my data frame.
Perhaps you would need to incorporate an if else function?
Something vaguely similar to this oversimplified code?
if df$col[0:5]%%2
print rowSum
else:
don't
Anyone have any ideas?
Much appreciated
Indexing with a logical vector and the recycling rule will give the nice solution:
rowSums(cars[c(FALSE, TRUE), ])
st.mat <- matrix(1:100,ncol = 10, nrow =10)
for(i in 1:dim(st.mat)[1]){
if(i %% 2 == 0){
print(sum(st.mat[i,]))
}
}
This would be a very simple way to do it.
I want to initialise a column in a data.frame look so:
df$newCol = 1
where df is a data.frame that I have defined earlier and already done some processing on. As long as nrow(df)>0, this isn't a problem, but sometimes my data.frame has row length 0 and I get:
> df$newCol = 1
Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements
I can work around this by changing my original line to
df$newCol = rep(1,nrow(df))
but this seems a bit clumsy and is computationally prohibitive if the number of rows in df is large. Is there a built in or standard solution to this problem? Or should I use some custom function like so
addCol = function(df,name,value) {
if(nrow(df)==0){
df[,name] = rep(value,0)
}else{
df[,name] = value
}
df
}
If I understand correctly,
df = mtcars[0, ]
df$newCol = numeric(nrow(df))
should be it?
This is assuming that by "row length" you mean nrows, in which case you need to append a vector of length 0. In such case, numeric(nrow(df)) will give you the exact same result as rep(0, nrow(df)).
It also kind of assumes that you just need a new column, and not specifically column of ones - then you would simply do +1, which is a vectorized operation and therefore fast.
Other than that, I'm not sure you can have an "empty" column - the vector should have the same number of elements as the other vectors in the data frame. But numeric is fast, it should not hurt.
I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)
How could I calculate the rowMeans of a data.frame based on matching column names?
Ex)
c1=rnorm(10)
c2=rnorm(10)
c3=rnorm(10)
out=cbind(c1,c2,c3)
out=cbind(out,out)
I realize that the values are the same, this is just for demonstration.
Each row is a specific measurement type (consider it a factor).
Imagine c1 = compound 1, c2 = compound 2, etc.
I want to group together all the c1's and average there rows together. then repeat for all unique(colnames(out))
My idea was something like:
avg = rowMeans(out,by=(unique(colnames(out)))
but obviously this doesn't work...
Try this:
sapply(unique(colnames(out)), function(i)
rowMeans(out[,colnames(out) == i]))
As #Laterow points out in the comments, having duplicate column names will lead to trouble at some point; if not here, elsewhere in your code. Best to nip it in the bud now.
If you are starting with duplicate column names, use make.unique on the colnames first to append .n where n increments for each duplicate starting at .1 for the first duplicate, leaving the initial unique names as is:
colnames(out) <- make.unique(colnames(out));
Once that's done (or as OP explained in the comments, if it was already being done by the column-creating function silently), you can do your rowMeans operation with dplyr::select's starts_with argument to group columns based on prefix:
library(dplyr);
avg_c1 <- rowMeans(select(out, starts_with("c1"));
If you have a large number of columns, instead of specifying them individually, you can use the code below to have it create a data frame of the rowMeans regardless of input size:
case_count <- as.integer(sub('^c\\d+\\.(\\d+)$', '\\1', colnames(out)[ncol(out)])) + 1L;
var_count <- as.integer(ncol(out) %/% case_count);
avg_c <- as.data.frame(matrix(nrow = var_count , ncol = nrow(out)));
for (i in 1:var_count) {
avg_c[i, 1:nrow(out)] <- rowMeans(select(as.data.frame(out), starts_with(paste0("c", i))));
}
As #Tensibai points out in comments, this solution may not be efficient, and may be overkill depending on your actual data set. You may not need the flexibility it provides and there's probably a more succinct way to do it.
EDIT1: Based on OP comments
EDIT2: Based on comments, handle all rowMeans at once
EDIT3: Fixed code bugs and clarified starting point reasoning based on comments
I think my question is very simple.
dat1<-seq(1:100)
dat2<-seq(1:100)
how can I combine dat1 and dat2 and make it look like
dat3<-seq(1:200)
Thanks so much!
How do you want to combine dat1 and dat2? By rows or columns? I'd take a look at the help pages for rbind() (row bind) , cbind() (column bind), orc() which combines arguments to form a vector.
Let me start by a comment.
In order to create a sequence of number on can use the following syntax:
x <- seq(from=, to=, by=)
A shorthand for, e.g., x <- seq(from=1, to=10, by=1) is simply 1:10. So, your notation is a little bit weird...
On the other hand, you can combine two or more vectors using the c() function. Let us say, for example, that a <- c(1, 2) and b <- c(3, 4). Then c <- c(a, b) is the vector (1, 2, 3, 4).
There exist similar functions to combine data sets: rbind() and cbind().