How do I sum an R dataframe into one cell? - r

I am trying to sum 2 dataframes named X and Y into one cell in a new dataframe named PL. The problem I am having is when I use this script :
df$PL <- sum(df$X + df$Y)
it propagates the entire PL column instead of just one cell.
How do I code it so it just fills one cell ?

The sum() function doesn't do vectorization. Just add them together. Also, you mean variables, i.e. columns, not data frames. df is, I assume, a data frame. You also mean rows, not cells, I would guess.
df$PL <- df$X + df$Y
If this is not what you want, then please share some example data along with what output you are looking for.

Related

How to create a "top ten" vector that keeps labels?

I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!

Retaining a value in an R dataset if it's present in another dataset

I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]

Changing values of multiple column elements for dataframe in R

I'm trying to update a bunch of columns by adding and subtracting SD to each value of the column. The SD is for the given column.
The below is the reproducible code that I came up with, but I feel this is not the most efficient way to do it. Could someone suggest me a better way to do this?
Essentially, there are 20 rows and 9 columns.I just need two separate dataframes one that has values for each column adjusted by adding SD of that column and the other by subtracting SD from each value of the column.
##Example
##data frame containing 9 columns and 20 rows
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
##Standard Deviation calcualted for each row and stored in an object - i don't what this objcet is -vector, list, dataframe ?
Hi_SD<-apply(Hi,2,sd)
#data frame converted to matrix to allow addition of SD to each value
Hi_Matrix<-as.matrix(Hi,rownames.force=FALSE)
#a new object created that will store values(original+1SD) for each variable
Hi_SDValues<-NULL
#variable re-created -contains sum of first column of matrix and first element of list. I have only done this for 2 columns for the purposes of this example. however, all columns would need to be recreated
Hi_SDValues$X1<-Hi_Matrix[,1]+Hi_SD[1]
Hi_SDValues$X2<-Hi_Matrix[,2]+Hi_SD[2]
#convert the object back to a dataframe
Hi_SDValues<-as.data.frame(Hi_SDValues)
##Repeat for one SD less
Hi_SDValues_Less<-NULL
Hi_SDValues_Less$X1<-Hi_Matrix[,1]-Hi_SD[1]
Hi_SDValues_Less$X2<-Hi_Matrix[,2]-Hi_SD[2]
Hi_SDValues_Less<-as.data.frame(Hi_SDValues_Less)
This is a job for sweep (type ?sweep in R for the documentation)
Hi <- data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD <- apply(Hi,2,sd)
Hi_SD_subtracted <- sweep(Hi, 2, Hi_SD)
You don't need to convert the dataframe to a matrix in order to add the SD
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD<-apply(Hi,2,sd) # Hi_SD is a named numeric vector
Hi_SDValues<-Hi # Creating a new dataframe that we will add the SDs to
# Loop through all columns (there are many ways to do this)
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]+Hi_SD[i]
}
# Do pretty much the same thing for the next dataframe
Hi_SDValues_Less <- Hi
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]-Hi_SD[i]
}

sapply() fucntion to generate summary statistics, need to remove a column before I use the function

I am trying to generate summary statistics by using sapply on 5 columns with numerical data. There is however 1 column with sex F/M (which is the second column of my dataframe), which I do not need to apply this to. I have tried removing the column by using
data_2 <- data_2[,2]
and a bunch of other methods but they do not seem to remove the column.
I have to work out the mean, sd, min, max and sample size with the sapply function.
In cases like this, I find it easier to use indices, instead of the data itself:
sapply((1:ncol(data_2))[-2], function(i) {
c(mean(data_2[,i]), sd(data_2[,i])) # add other functions
})
Use
data_2 <- data_2[, -2]
minus removes the column, without the minus you're just returning the second column.
However overwriting data_2 with data_2[, -2] is not optimal, so better just to run sapply on data_2[, -2].

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Resources