Retrieve correspondent rows when startsWith == TRUE as numeric values - r

I am trying to write a script for a complete automation of my data.
Here, I am trying to implement a loop in order to search for all the values which start with "Blank" in the name column of my data.frame.
How can I print all the correspondent rows in a vector?
i.e. I have a value Blank C in the column names at row 5, I want to get a vector with the values of the same row (5) in columns names and the other data columns 3:6, not as NA but as values because the output is NA.
for (val in ms.data$Name) {
if(startsWith(val,"Blank")) {
print(list(ms.data[val,3:6]))
}
}
Example data
Blank 1 05-Apr-17 7:04 PM 5.771899218 4.922906441 219.0199184 7.779938257
Blank 2 05-Apr-17 7:15 PM 4.913695034 4.071889653 2.161167065 2.567102283
Thank you

Related

R: Merge rows within one Dataframe with different constellation von Data and NAs

I have a dataframe like this:
df <- data.frame(ID=c(1,2,3,3,4,4,5),
Name=c("Name1","Name2","Name3", "Name3","Name4", "Name4", "Name5"),
Price2012=c(343,767,NA, 43,NA,330, 646),
Price2013=c(423,763,35, 0,304,350, 636))
In the third and fourth as well as in the fifth and sixth row, the ID and the Name is identical. However, the values in the two price columns vary. For ID 3, there is a NA for Price2012 while a value of 35 for Price2013, but also a 43 for Price2012 and a 0 for Price2013. For ID 4, there are also two rows, but no zero and one NA.
I want an output like this, where rows which include a NA in Price2012 in one row and a zero in Price2013 in the other row (with the same ID and Name) are merged to one row and where the NA in Price2012 will be replaced by zero only if there are values for all other cells with the same ID and Name.
df_wish <- data.frame(ID=c(1,2,3,4,4,5),
Name=c("Name1","Name2", "Name3","Name4", "Name4", "Name5"),
Price2012=c(343,767,43,0,330, 646),
Price2013=c(423,763,35,304,350, 636))
I tried to use the aggregate function and I tried to get a solution via subsets, but both resulted in errors.

How do I aggregate results to get the most common number in each row of a dataset in R Studio

Im doing some consensus clustering, and it returns a set called "consensus_imouted" of 3000 rows with ten repetitions each with the cluster number (ranging from 1-6). I want to return just one column for each row with the most common cluster number for each. for example, the first row is 3 3 3 3 3 3 3 3 6 3, so i would want it to be 3 etc. any help?
You can use the apply function as follows:
sampledata <- matrix(sample(1:6,30000,replace = TRUE), ncol = 10, nrow = 3000)
sampledata <- data.frame(sampledata)
sampledata$mostCounts <- apply(sampledata,1, function(row0) {
as.numeric(names(which.max(table(row0))))
})
To get the most frequent value, just count the values in the row via table. Then, choose the value with the highest count using which.max. In a table, the values corresponding to the counts are the names of the table, hence use names to extract the original value. Now, since you know it is number, just cast the character to a numeric using as.numeric.

Extracting rows from data frame based on another data frame

I'm trying to extract a set of genes (row names) from my large data set based on another data matrix that contains a list of my genes of interest. I've read about that I should use the filter and %in% command, but am unsure as to how to write it properly.
example:
my large database:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
B. 5. 10. 15
C. 2. 4. 6
D. 20. 18. 19
my small data base:
Gene
A
C
D
And I want my result to be:
Gene Week1 Week 2. Week 3
A. 20. 14. 5
C. 2. 4. 6
D. 20. 18. 19
Could anybody please help out? I'd really appreciate it and my apologies for the rather simple question :)
Using logical row indexes:
large_database[large_database$Gene %in% unique(small_data_base$Gene), ]
Explanation:
large_database$Gene %in% unique(small_data_base$Gene)
Checks for each entry (i.e. row) in large_database$Gene if it appears in unique(small_database$Gene) i.e. the list of unique values in the column Gene of small_data_base and returns a boolean vector (a vector of TRUE and FALSE).
We then can use this vector as a row 'index' to selecet only rows where the vector is TRUE (i.e. the value of large_database$Gene was in unique(small_database$Gene)

Identify duplicated rows based on multiple columns and specific value in another column in very large matrix with for loop

I have a large matrix called data of 10,864 rows and 134 columns.
The first 4 columns are parameters which make every row unique. The data from 5th to 134th column for all rows are numbers between 1 and 20.
I am running a for loop in the matrix to insert NA into certain cells of the matrix. This needs to be done on the basis of unique values from Columns OrgID, rank and scorei only if value in same row for column score(i+12) != 1.
Hence, I run a for loop from column 5 to 134 and where there is duplication based on these three columns and value in score(i+12)column value is not equal to 1, I insert NA into that cell of matrix.
for(i in 5:ncol(data){
data[which(duplicated(data[,c(1,4,i)]) & (data[,i+12])!=1),i] <- "NA"
}
This code, however, gives the wrong output by inserting NA only where there is duplicated value on the basis of 1st,4th and ith column i.e. equivalent result to running the following code:
for(i in 5:ncol(data){
data[which(duplicated(data[,c(1,4,i)])),i] <- "NA"
}
How do make it to perform the required operation only when value in column score(i+12) !=1 in the duplicated rows.
To make it simpler to see the failed output, I have highlighted a few rows and the relevant columns to show how this works when applied to the column 118 i.e.i =118 here.
For example, based on the above explained logic, there is duplication in OrgID=5659. The duplication based on OrgID, rank and score118 identifies these 2 rows with one row showing value in score130=1and other score130=16. Hence, in the row with score130=16 should be now NA according to the logic. But this remains unchanged at 16.
Maybe you can try
for(i in 5:(ncol(data) - 12)) {
inds <- duplicated(data[c(1,4,i)]) | duplicated(data[c(1,4,i)], fromLast = TRUE)
data[inds & data[[i + 12]] != 1, i + 12] <- NA
}

How to print an certain amount of rows from a read.table in R Studio

da=read.table("m-ibm6708.txt",header=T) #<== Load data with header
#<== Check dimension of the data (row = sample size, col = variables)
###############502 rows, col 1 = date, col 2 = ibm, col 3 = sprtn
#<== Print out the first 6 rows of the data object "da".
printrows <- da[1:6]
printrows
The print rows didn't work. I tried a bunch of things. I think it might use cbind. da is a big table, but I only need the first 6 rows displayed.
As jdharrison said, head(da,6) will work - alternatively you can use the indices to print the top 6 rows da[1:6,]
When using the indices notation remember it's [rows,columns] and you must include the comma if you have a data.frame or matrix - i.e. [1:6,] for the first six rows or [,1:6] for the first six columns.
Looking like more advance selection then->
table_name[c(2:6, 9, 120), ]
note: from 2 to 6 rows | then 9 no: | then 120 no: row. in this way you can select a specific range as well as specific numbers of rows (or columns).
if you want to do this for columns just put a comma before like table_name[ ,c(2:6, 9, 120) ] before comma emptly means you selected all rows.

Resources