R : Finding the corresponding row value - r

I'm trying to get the data from column one that matches with column 2 but only on the "B" values. Need to somehow make the true values a list.
Need this to repeat for 50,000 rows. Around 37,000 of them are true.
I'm incredibly new to this so any help would be nice.
Data <- data.frame(
X = sample(1:10),
Y = sample(c("B", "W"), 10, replace = TRUE)
)
Count <- 1
If(data[count,2] == "B") {
List <- list(data[count,1]
Count <- count + 1
#I'm not sure what to use to repeat I just put
Repeat
} else {
Count <- count + 1
Repeat
}
End result should be a list() of only column one data.
In this if rows 1-5 had "B" I want the column one numbers from that.

Not sure if I understood correctly what you're looking for, but from the comments I would assume that this might help:
setNames(data.frame(Data[1][Data[2]=="B"]), "selected")
# selected
#1 2
#2 5
#3 7
#4 6
No loop needed.
data
Data <- structure(list(X = c(10L, 4L, 9L, 8L, 3L, 2L, 5L, 1L, 7L, 6L),
Y = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L),
.Label = c("B", "W"), class = "factor")),
.Names = c("X", "Y"), row.names = c(NA, -10L),
class = "data.frame")

Related

Converting a data frame to a motified list

although there are alot of questions concering this topic; I can not seem to find the correct question answer. Therefore I am directing this question to you guys.
The context:
I've got a data set with alot of rows (+150K) with 32 corresponding columns. The second column is a document number. The document number is not a unique ID. So the date contains rows with mutiple rows with the same document number. I like to create a list of the document numbers. This list of document numbers contains another list with the corresponding rows with the same document numbers.
For example:
Here is an example of the data (I included a dput output of the example below).
Document Number Col.A Col.B
A random_56681 random_24984
A random_78738 random_23098
A random_48640 random_32375
B random_96243 random_96927
B random_72045 random_52583
C random_19367 random_20441
C random_96778 random_22161
C random_48038 random_95644
C random_62999 random_44561
Now here is what I am looking for. I need a list that contains the 3 documents (A, B, C). Each of these list needs to contain another list containing the corresponding rows. For example, the main list (lets say my_list) should have 3 lists A, B and C; each of the lists should contain respectively 3, 2 and 4 lists.
Hope I was clear enough in asking the question (if not please let me know).
Here you can find the example data:
structure(list(Document_Number = structure(c(1L, 1L, 1L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Col.A = structure(c(4L, 7L, 3L, 8L, 6L, 1L, 9L, 2L, 5L), .Label = c("random_19367",
"random_48038", "random_48640", "random_56681", "random_62999",
"random_72045", "random_78738", "random_96243", "random_96778"
), class = "factor"), Col.B = structure(c(4L, 3L, 5L, 9L,
7L, 1L, 2L, 8L, 6L), .Label = c("random_20441", "random_22161",
"random_23098", "random_24984", "random_32375", "random_44561",
"random_52583", "random_95644", "random_96927"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
You can use split like:
split(x, x$Document_Number)
#$A
# Document_Number Col.A Col.B
#1 A random_56681 random_24984
#2 A random_78738 random_23098
#3 A random_48640 random_32375
#
#$B
# Document_Number Col.A Col.B
#4 B random_96243 random_96927
#5 B random_72045 random_52583
#
#$C
# Document_Number Col.A Col.B
#6 C random_19367 random_20441
#7 C random_96778 random_22161
#8 C random_48038 random_95644
#9 C random_62999 random_44561
An option is group_split
library(dplyr)
df1 %>%
group_split(Document_Number)

Overwrite levels of factor columns in one dataframe using another

I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.
Consider this example:
base <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Non-Compounding and Standard Non-Compounding",
"OCR based Call", "Offsale Call", "Offsale Savings",
"Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)
final <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Standard Non-Compounding", "OCR based Call",
"Offsale Call", "Offsale Savings","Offsale Transactional",
"Out of Scope","Personal Call", "You Money"))
final$product <- as.factor(final$product)
What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched
Eg:
levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"
Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?
Like I want the final levels for both data to be the same. i.e.
levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"
We could build our own fuzzyMatcher.
First, we'll need kinda vectorized agrep function,
agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))
on which we build our fuzzyMatcher.
fuzzyMatcher <- function(from, to) {
mc <- mapply(function(y)
which(mapply(function(x) agrepv(y, x), Map(levels, to))),
Map(levels, from))
return(Map(function(x, y) `levels<-`(x, y), base,
Map(levels, from)[mc]))
}
final labels applied on base labels (note, that I've shifted columns to make it a little more sophisticated):
base[] <- fuzzyMatcher(final1, base1)
# X1 X2
# 1 Aa Xx
# 2 Aa Xx
# 3 Aa Yy
# 4 Aa Yy
# 5 Bb Yy
# 6 Bb Zz
# 7 Bb Zz
# 8 Aa Xx
# 9 Cc Xx
# 10 Cc Zz
Update
Based on the new provided data above it'll make sense to use another vectorized agrepv2(), which, used with outer(), enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final. We can use these two resulting vectors on the one hand to delete unused rows of final, on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.
# add to mimic other columns in data frame
base$x <- seq(nrow(base))
final$x <- seq(nrow(final))
# some abbrevations for convenience
p1 <- levels(base$product)
p2 <- levels(final$product)
# agrep
AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y])) # new vectorized agrep
out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9)) # apply `agrepv2`
del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
lvl <- unlist(apply(out, 2, which.max)) # find positive matches
lvl <- as.character(p2[lvl]) # get the labels
# delete "non-existing" rows and re-generate factor with new labels
transform(final[-del.col, ], product=factor(product, labels=lvl))
# product x
# 1 Business Call 1
# 2 Business Transactional 2
# 4 OCR based Call 4
# 5 Offsale Call 5
# 6 Offsale Savings 6
# 7 Offsale Transactional 7
# 8 Out of Scope 8
# 9 Personal Call 9
Data
base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L,
1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L,
2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"),
X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA,
-15L), class = "data.frame")

Return one out of multiple rows with partially matching entries

I have a dataset with the name of proteome. It has 14 columns and thousands of rows.
dput(Proteome)
structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L,
3L), .Label = c("HCTF", "IFT", "ROSF"), class = "factor"), X..Proteins = c(5L,
5L, 5L, 5L, 3L, 7L), X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L), Previous.5.amino.acids = structure(c(4L,
5L, 4L, 2L, 3L, 1L), .Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY",
"TMYFC"), class = "factor"), Sequence = structure(c(5L, 1L, 4L,
2L, 3L, 6L), .Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"), class = "factor")), .Names = c("Protein.name",
"X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"
), class = "data.frame", row.names = c(NA, -6L))
The column of interest in this dataset is "Sequence". In row 2 of this column, first two letters of row 1 are missing; in row 3, last three letters of row 1 are missing; in row 4, first seven and last three letters of row 1 are missing.
Rows 2, 3, and 4 reflect the artifacts of the scientific method I have been using to generate the data, and therefore I want to remove these entries.
I want R to return only one of the four rows, ideally row 1, and remove the rest. The way R can do it is by first finding all rows with a matching string of letters and then eliminating such rows while keeping only one. For example, in the above data set, GCNFHAESTR match in all four rows, so I want R to return me only one row, ideally the top one. But I don't know how to do this.
To further clarify, "Sequence" has hundreds of rows with partially matching entries but the matching entries in those rows are different from the one shown in the example above. For example, it is possible that row no. 35 and 39 have the following entries (Row 35: GNYTCAGCWPFK, and Row 36: YTCAGCWPFK). As matching entries in these rows are totally different than the ones in the example above, I can not declare the string beforehand. So, I want to come up with a mechanism that allows me to detect all those rows which have a partially matching entries and then keep only one of them, while delete others.
I look forward to hearing from the experts.
Thanks!
If I understood correctly, you just need to subset your data according to the presence of the string you want. Use grepl for that.
aa <- structure(list(Protein.name = structure(c(1L, 1L, 1L, 1L, 2L, 3L),
.Label = c("HCTF", "IFT", "ROSF"),
class = "factor"),
X..Proteins = c(5L, 5L, 5L, 5L, 3L, 7L),
X..PSMs = c(3L, 1L, 6L, 2L, 2L, 4L),
Previous.5.amino.acids = structure(c(4L, 5L, 4L, 2L, 3L, 1L),
.Label = c("CWYAT", "FCLKP", "MGCPT", "NCTMY", "TMYFC"),
class = "factor"),
Sequence = structure(c(5L, 1L, 4L, 2L, 3L, 6L),
.Label = c("FCLKPGCNFHAESTRGYR", "GCNFHAESTR", "GFGFNWPHAVR",
"GHFCLKPGCNFHAESTR", "GHFCLKPGCNFHAESTRGYR", "GNFSVKLMNR"),
class = "factor")),
.Names = c("Protein.name", "X..Proteins", "X..PSMs", "Previous.5.amino.acids", "Sequence"),
class = "data.frame", row.names = c(NA, -6L))
It is good for you to declare the string beforehand
myStrToDetect <-'GCNFHAESTR'
#the following line filters the data set into those where "Sequence" has the pattern you provided (4 rows)
matching_df <- aa[grepl(myStrToDetect , aa$Sequence),]
Protein.name X..Proteins X..PSMs Previous.5.amino.acids Sequence
1 HCTF 5 3 NCTMY GHFCLKPGCNFHAESTRGYR
2 HCTF 5 1 TMYFC FCLKPGCNFHAESTRGYR
3 HCTF 5 6 NCTMY GHFCLKPGCNFHAESTR
4 HCTF 5 2 FCLKP GCNFHAESTR
# This next command chooses only the first line, if there are multiple occurrences
head(matching_df, 1)
Protein.name X..Proteins X..PSMs Previous.5.amino.acids Sequence
1 HCTF 5 3 NCTMY GHFCLKPGCNFHAESTRGYR

Find pairs of rows with identical values in different columns

I'm trying to subset some data but got stock at this part. My data looks like this:
structure(list(sym_id = structure(c(1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 4L, 5L, 5L), .Label = c("AOL.HH", "ARCH.GA", "ARCH.GK",
"T.GJ", "T.GK"), class = "factor"), comp = structure(c(1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("AOL", "ARCH",
"T"), class = "factor"), seq_nb = c(18327L, 9952L, 39808L,
56601L, 44974L, 55302L, 20023L, 24403L, 15529L, 46202L, 57269L
), orig_seq_nb = c(81261L, 72161L, 9952L,
1276L, 98216L, 16423L, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_)), .Names = c("bond_sym_id",
"company_symbol", "seq_nb", "orig_seq_nb"), row.names = c(NA,
-11L), class = c("tbl_df", "tbl", "data.frame"))
I'm looking for a code that would give me back rows which have identical values in different columns but also identical values in another.
The output should give me back
Row1 ARCH.GA ARCH 9952 72161
Row2 ARCH.GA ARCH 39808 9952
As you can see, the columns "sym_ID" and "comp" are equal for my desired output and the values in "seq_nb" and "orig_seq_nb" match.
Appreciate your help!
We subset the dataset with 3rd and 4th columns, loop through the rows, order, get the 1st element, cbind with the first two columns, use duplicated to find the logical index of duplicate elements and this can be used for subsetting the rows of 'df1'.
d2 <- cbind(df1[1:2], apply(df1[3:4],1, function(x) x[order(x)][1]))
df1[duplicated(d2)|duplicated(d2, fromLast=TRUE),]
# bond_sym_id company_symbol seq_nb orig_seq_nb
# <fctr> <fctr> <int> <int>
#1 ARCH.GA ARCH 9952 72161
#2 ARCH.GA ARCH 39808 9952

R program, ?count, rename "freq" to something else

I am studying this webpage, and cannot figure out how to rename freq to something else, say number of times imbibed
Here is dput
structure(list(name = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("Bill", "Llib"), class = "factor"), drink = structure(c(2L,
3L, 1L, 4L, 2L, 3L, 1L, 4L), .Label = c("cocoa", "coffee", "tea",
"water"), class = "factor"), cost = 1:8), .Names = c("name",
"drink", "cost"), row.names = c(NA, -8L), class = "data.frame")
And this is working code with output. Again, I'd like to rename the freq column. Thanks!
library(plyr)
bevs$cost <- as.integer(bevs$cost)
count(bevs, "name")
Output
name freq
1 Bill 4
2 Llib 4
Are you trying to do this?
counts <- count(bevs, "name")
names(counts) <- c("name", "number of times imbibed")
counts
The count() function returns a data.frame. Just rename it like any other data.frame:
counts <- count(bevs, "name")
names(counts)[which(names(counts) == "freq")] <- "number of times imbibed"
print(counts)
# name number of times imbibed
# 1 Bill 4
# 2 Llib 4

Resources