Quite new to R...
I loaded a file with 13458 observations containing a time and a value. I ran it through a program which detects homologue series. The output is a large list with 6 elements, including values IDed by the row number in the original file.
I would like to export the original file with values detected by the program marked somehow so I can easily identify them in Excel. Hopefully that makes some sense.
My dataframe looks like this and I'm using the m.z and RT values:
m.z dummy RT
1 151.0092 255975.8 15.043
2 151.0092 110111.7 15.456
3 151.0092 108958.1 15.243
4 151.0093 3258343.0 14.620
5 151.0127 107255.9 6.336
My output contains a list of related series and looks like this:
[359] "3518,4779,5929,6975,8032,9051,9825"
[360] "5927,6977,8036,9052,9824,10507,11043"
I would like a data frame that lets me know if a value has been identified, as this:
m.z dummy RT homologue
3518 459.2006 255975.8 15.043 TRUE
3519 459.2120 110111.7 15.456 FALSE
3520 459.2159 108958.1 15.243 FALSE
Thanks!
Here is an attempt
your MS data:
DF <- read.table(text="m.z dummy RT
1 151.0092 255975.8 15.043
2 151.0092 110111.7 15.456
3 151.0092 108958.1 15.243
4 151.0093 3258343.0 14.620
5 151.0127 107255.9 6.336", header = T)
the script output:
vec <- c("1,3,5", "3,5") #from your example looks like a vector of strings with numbers separated by a comma
As I understand you would like to label rows in df with TRUE/FALSE depending on appearance anywhere in vec?
DF$homologue <- ifelse(row.names(DF) %in% as.numeric(unlist(strsplit(unlist(vec), ","))), T, F)
explanation:
unlist(vec) #in case it is a list and not a vector
strsplit(unlist(vec), ",") #split strings at "," returning a list
unlist(str... #convert that list into a vector
as.numeric(unlist(str... #convert to numeric
if any row names of DF are in vec they will be labeled T and if not F
DF
m.z dummy RT homologue
1 151.0092 255975.8 15.043 TRUE
2 151.0092 110111.7 15.456 FALSE
3 151.0092 108958.1 15.243 TRUE
4 151.0093 3258343.0 14.620 FALSE
5 151.0127 107255.9 6.336 TRUE
Related
I have raw data with multiple observation and I have a cleaning log which contains some new values for specific columns of raw data I want to replace old values with these new ones.
My raw data is :
raw_df<- data.frame(
id=c(1,2,3,4),
name=c("a","b","c","d"),
age=c(15,16,20,22),
add=c("xyz","bc","no","da")
)
MY cleaning log is :
cleaning_log <- data.frame(
id=c(2,4),
question=c("name","age"),
old_value=c("b",22),
new_value=c("bob",25)
)
And my expected result is :
result<-data.frame(
id=c(1,2,3,4),
name=c("a","bob","c","d"),
age=c(15,16,20,25),
add=c("xyz","bc","no","da")
)
Note:At the end how can I check whether these new values are replaced properly or not?
In addition, in cleaning log question column I may have more than two columns like 10 to 20 which possibly will have new value but here I just give two column names as an example.
Thanks in advance for you help
Find out the row number and column number to change in raw_df using match and replace it with cleaning_log$new_value.
row_inds <- match(cleaning_log$id, raw_df$id)
col_inds <- match(cleaning_log$question, names(raw_df))
raw_df[cbind(row_inds, col_inds)] <- cleaning_log$new_value
raw_df
# id name age add
#1 1 a 15 xyz
#2 2 bob 16 bc
#3 3 c 20 no
#4 4 d 25 da
I want to map the FactorName in the dataframe FName to the column header names of Stack. Ie Factor1 in Stack is actually named Value, Factor 2 is Leverage etc. I have a large dataset so manually renaming is not an option.
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(FactorID=c("Factor1","Factor2","Factor3"), FactorName=c("Value","Leverage","Growth"))
Thanks.
How about this using match:
Stack <- data.frame(rowid=1:3, Factor1=2:4, Factor2=3:5, Factor3=4:6)
FName <- data.frame(
FactorID=c("Factor1","Factor2","Factor3"),
FactorName=c("Value","Leverage","Growth"))
# Matching entries from FName
colnames(Stack) <- ifelse(
!is.na(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
as.character(FName$FactorName[match(colnames(Stack), FName$FactorID)]),
colnames(Stack));
Stack;
# rowid Value Leverage Growth
#1 1 2 3 4
#2 2 3 4 5
#3 3 4 5 6
Explanation: We match column names of Stack and entries from FName$FactorID. If there is a match, replace with FName$FactorName, else keep the original column name.
if we have factor names handy then we can use that to change the column names
colnames(Stack) <- "facotor header file"
Another approach using match, but using indexing instead of ifelse
# Get indices of matches
m <- match(names(Stack), FName$FactorID)
# replace names where a match is found.
names(Stack)[!is.na(m)] <- as.character(FName$FactorName[m[!is.na(m)]])
I need to search through a text string for keywords and then assign a category in an R dataframe. This creates a problem where I have keywords from more than one category. I would like to easily extract rows where more than one category is represented so that I can manually evaluate them and assign the correct category.
To do this, I have tried to add a count column to show how many categories are represented in each string.
Using a combination of the two solutions linked below, I have managed to get part of the way, but I am still not getting the correct output
Partial animal string matching in R
Count occurrences of specific words from a dataframe row in R
I have created an example below. I would like the following rules to be applied:
if string has cat or lion wcount gets 1 - only 1 group represented (feline)
if string has dog or wolf wcount gets 1 - only 1 group represented (canine)
if string has (cat or lion) AND (dog or wolf) wcount get 2 - two groups represented (feline and canine)
I can then easily pull out rows where wcount > 1
id <- c(1:5)
text <- c('saw a cat',
'found a dog',
'saw a cat by a dog',
'There was a lion',
'Huge wolf'
)
dataset <- data.frame(id,text)
SearchGrp<-list(c("(cat|lion)", "feline"),
c("(dog|wolf)","canine"))
output_vector<- character (nrow(dataset))
for (i in seq_along(SearchGrp)){
output_vector[grepl(x=dataset$text, pattern = SearchGrp[[i]][1],ignore.case = TRUE)]<-SearchGrp[[i]][2]}
dataset$type<-output_vector
keyword_temp <- unlist(lapply(SearchGrp, function(x) new<-{x[1]}))
keyword<-paste(keyword_temp[1],"|",keyword_temp[2])
library(stringr)
getCount <- function(data,keyword)
{
wcount <- str_count(dataset$text, keyword)
return(data.frame(data,wcount))
}
getCount(dataset,keyword)
Here is a base R method to get the count across types.
dataset$wcnt <- rowSums(sapply(c("dog|wolf", "cat|lion"),
function(x) grepl(x, dataset$text)))
Here, sapply runs through the regular expressions of each type and feeds it to grepl. This returns a matrix, where the columns are logical vectors indicating if a particular type (eg, "dog|wolf") was found. rowSums sums the logicals along the rows to get the type variety count.
This returns
dataset
id text wcnt
1 1 saw a cat 1
2 2 found a dog 1
3 3 saw a cat by a dog 2
4 4 There was a lion 1
5 5 Huge wolf 1
If you want the intermediary step, returning logical vectors as variables in your data.frame, you would probably want to set your values up in a named vector and then do cbind with the result.
# construct named vector
myTypes <- c("canine"="dog|wolf", "feline"="cat|lion")
# cbind sapply results of logicals to original data.frame
dataset <- cbind(dataset, sapply(myTypes, function(x) grepl(x, dataset$text)))
This returns
dataset
id text canine feline
1 1 saw a cat FALSE TRUE
2 2 found a dog TRUE FALSE
3 3 saw a cat by a dog TRUE TRUE
4 4 There was a lion FALSE TRUE
5 5 Huge wolf TRUE FALSE
I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.
I have a function called notes_count(id) that takes a vector as a parameter (for example the function can accept different arguments 5, c(1,2,3), 6:20, or 5:1 to name a few) and returns the ID and "count" of the notes. I have a data frame with the following contents:
"ID" "Date" "Notes"
that contains an unknown amount of entries per "ID" for example:
ID Date Notes
1 xxx "This is a note"
1 xxx "More notes here"
...
8 xxx "Hello World"
The problem I am running into is that I want the output to be ordered in the same way as the input vector meaning notes_count(3:1) should list the results in reverse order as a data frame:
ID notes_count
1 3 6
2 2 288
3 1 102
and calling notes_count(1:3) would result in:
ID notes_count
1 1 102
2 2 288
3 3 6
however table always reorders from min to max despite the order it is given originally. Is there a way to do what table is doing directly on the data frame but using other functions so that I can control the output.
Currently my code is this:
#Before calling table I have data frame "notes" in the order I want but table reorders it
notes_count <- as.data.frame(table(notes[["ID"]]))
which seems silly to make the original data frame a table and then convert it back.
EDIT:
Here is my code as basic as it is as requested
notes_count <- function(id){
## notes.csv format
## "ID","Date","Notes"
## 1,"2016-01-01","Some notes"
#read the csv to a data frame
notes <- read.csv("notes.csv")
#remove all NA values
notes <- notes[complete.cases(notes), ]
#here is where you can order the data but it won't matter when aggregating the notes to a "count" using table on the next line
notes <- notes[id, ]
#convert the table back to a data frame
notes_count <- as.data.frame(table(notes[["ID"]]))
notes_count
}
Here's a simplified example that should get you going:
set.seed(1234)
notes <- data.frame(id=sample(2:10,size = 100, replace = TRUE), Note="Some note")
notes_count <- function(id) {
counts <- table(notes[notes$id %in% id,])
return(data.frame(count=counts[as.character(id),]))
}
notes_count(c(10,2,5))
# Results
count
10 8
2 12
5 2
If I understand correctly, you want to sort the dataframe by the notes_count variable?
then use order function and reshuffle the df rows.
your_data_frame[order(your_data_frame$notes_count,decreasing=TRUE),]