Sum values in Rows - r

here is my question.
I have a dataframe with 30 rows (corresponding to 30 questions in a questionnaire) with values from 1 to 5 as answers.
I would like to sum all values equal to 1 that appears in the 30 rows.
I tried with the command aggregate, but it doesn't work.

The question could use more clarity, code would help, but I will give you a theoretical of what I believe you are asking for
If you have a data frame df such that:
questions ob1 ob2 ob 3
q1 5 3 1
q2 2 1 1
q3 4 1 5
and you want to add up all the values where something is equal to answer of 1 you have a number of options, but the most obvious is simply subset with a logical
or you could
sumob1<- sum(df$ob1[ , which(df$ob1==1)])
Watch for the leading comma in the [] it tells R to include all rows (on the left side of the comma) and just the values equal to the subset column on the right.
Which basically says I would like to make sumob1 equal to the sum of the column ob1 for all row cells in which column df$ob1 has a value of 1.
You can do that for each column.

Related

Removing/collapsing duplicate rows in R

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]
Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4
Your code is keeping the records containing maximum value per gene.

Using list of row numbers as criteria to populate field

I have a list of row numbers that represent row containing outliers in a data set. I would like to add an "outlier" column to the original data set that flags the rows containing outliers, but I can't figure out how to use row numbers as criteria in r.
Example:
I have a dataframe like this:
id <-c("a","b","c","d")
values <-c(10,11,22,33)
df<-data.frame(names,values)
id values
1 a 10
2 b 11
3 c 22
4 d 33
And a list like this containing row number (more correctly "row names"):
outliers <-c(2,4)
I'd like to find a way to use the list of row numbers as criteria in something like:
df$outlier_test<-ifelse( if row number is on my list, "outlier","")
to produce something like this:
id values outlier_test
1 a 10
2 b 11 outlier
3 c 22
4 d 33 outlier
Spent quite a while trying to puzzle this out and had inspiration as soon as I posted the question. For anyone else who comes here with this question:
First:
df$rownumber<- row.names(df)
then:
df$outlier_test<- ifelse(df$rownumber %in% outliers,"outlier","")

How to select rows based on median value in R?

I am quite new to R and cannot figure this one out. Let’s say I have a data frame with four columns. The first column determines group membership, the second column should be used for filtering and the two last columns should just follow along. It will look like below:
> test.data
group filter a b
first 1 1 2
first 2 3 1
first 3 2 3
second 1 2 1
second 2 2 5
second 3 3 1
second 4 3 1
For each group, I would like to calculate the median in the filter column. The same rows should then be used in column a and b to, when necessary, calculate the mean of the two rows or just return the one row if number of rows is odd.
The result should be:
group filter a b
first 2 3 1
second 2.5 2.5 3
When using dplyr, I can calculate the median of each column independently of the filter column, but not with regard to the filter column:
median.data <- test.data %>% group_by(group) %>% summarise_all(funs(median))
> median.data
group filter a b
first 2.0 2.0 2
second 2.5 2.5 1
When using tapply, I can calculate the median, but don't know how to also take the other columns into account:
median.data <- tapply(test.data$filter, test.data$group, median)
> median.data
first second
2.0 2.5
Then I figured that I should try to write a function myself that performs the steps below.
for each group:
order by column "filter"
extract middle row, two rows if even
calculate mean
But then I got stuck on how to find the middle (or two middle) rows...
Do you have any suggestions on how to solve it? Any help would be greatly appreciated!

R select multiple rows by conditional row number

I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!

Reordering a paired variable

I was wondering about the following thing:
I have a 16x2 matrix with in the first column numerical values and in the second column also numerical values but actually they're position numbers so they need to be treated as a factor.
I want to order the values from the first column from low to high but I need the numbers of the second column to stay with their original partner value from the first column.
So let's say you've got:
4 1
6 2
2 3
And now I want to sort the first column from low to high.
Then I want to get
2 3
4 1
6 2
Does anybody know how I can do this?
R doesn't seem to provide a variable type for paired data...
You can do:
dat[order(dat[, 1]), ]

Resources