I want to match an original ID with a new ID which is only a fragment of the original ID and return all of the original IDs. Ex. For a data.frame dat, OrigID is a column name. ID value is XXX_X_XXX and the new ID is only the last portion after the underscore sign _, which is XXX. How can I match this?
I'm not sure how to return only the fragment. I think this returns all hits and not just the portion after the '_' giving me too many values. I also want to place NA values in the vector wherever the ID's don't match.
Ex.
IDdat <- read.csv("OrigID.csv")
data <- read.csv("data.csv")
subjects <- unique(data$ID)
IDlist <- c()
for (i in 1:length(subjects)) {
OrigID <- grep(subjects[i], IDdat$ID, value = TRUE)
IDlist <- rbind(IDlist, data.frame(OrigID)
}
Thanks!
We can use grep
grep(new_ID, colnames(dat))
Related
I am trying to remove a row in a dataframe based on string matching. I'm using:
data <- data[- grep("my_string", data$field1),]
When there's an actual row with the value "my_string" in data$field1 this works as expected and it drops that row. However, if there is no string "my_string", it creates an empty dataframe. How to I do write this so that it allows for the possibility of the string to not exist, and still keeps my data frame intact?
It may be better to use grepl and negate with !
data[!grepl("my_string", data$field1),]
Or another option is setdiff on grep
data[setdiff(seq_len(nrow(data)), grep("my_string", data$field1)),]
You can use a plain if statement.
df <- data.frame(fieled = c("my_string", "my_string_not", "something", "something_else"),
numbers = 1:4)
result <- grep("gabriel", df$fieled)
if (length(result))
{
df <- df[- result, ]
}
df
result <- grep("my_string", df$fieled)
if (length(result))
{
df <- df[- result, ]
}
df
I have a dataframe with an ID column and another dummy column. In the first step the user enters a number which should be one of the IDs (ID_edit). Then the respective row index is determined. If the ID is in the dataframe everything works fine. If not (because the user enters a wrong ID or no ID at all) there should be an error message. I tried this:
test_df <- data.frame("ID" = c(1,3,6,8),
"char" = c("a","b","c","d"))
ID_edit <- as.integer(2)
row_nr_df <- which(test_df$ID == ID_edit, arr.ind=TRUE)
View(test_df$ID)
row_list <- as.numeric(rownames(test_df))
if(!is.null(row_nr_df %in% row_list)) {
print("Row number in row list")
} else {
print("Row number not in row list")}
View(row_nr_df)
If I change
ID_edit <- as.integer(1)
which is working, to
ID_edit <- as.integer(2)
the if-statement is still TRUE, but I expect and want to have the else block here.
View(row_nr_df)
shows then the message "No data available in table".
In the end I want to access the dataframe with the row number, e.g.:
char_edit <- test_df$char[[row_nr_df]]
But this is not working, if the row number does not exist .
test_df <- data.frame("ID" = c(1,3,6,8),
"char" = c("a","b","c","d"))
isin<-function(x,data)
{if(length(which(data$ID == x, arr.ind=TRUE))>=1)
{data[which(data$ID == x, arr.ind=TRUE),]}
else{"not in list"}}
> isin(x=3,data=test_df)
ID char
2 3 b
> isin(x=2,data=test_df)
[1] "not in list"
I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')
I have a dataframe with column names mycolumns (have more than 2000 columns). I have this obect called myobject which contains sets of strings that partially matches with the column names(each matches with only one column name) in mycolumns. I want to replace the column names with the respective strings in my object.So the new column names of the dataframe will be "jackal","cat.11","Rat.Fox". Please note this has to be done by using pattern matching or regex as the order of the matched names could be different in myobject.
mycolumns <- c("jackal.fox11.FAD", "cat.11.miss.DAD", "Rat.Fox.11.33.DDG")
myobject <- c("jackal","Rat.Fox","cat.11")
How about a for loop with grep:
#your example
mycolumns <- c("jackal.fox11.FAD", "cat.11.miss.DAD", "Rat.Fox.11.33.DDG")
myobject <- c("jackal","Rat.Fox","cat.11")
#for loop solution
for(i in myobject){
mycolumns[grepl(i, mycolumns)] <- i
}
Data setup:
> mycols = qw("jackal.fox11.FAD cat.11.miss.DAD Rat.Fox.11.33.DDG")
> df = read.csv(textConnection("1,2,3"), header=F)
> names(df) = qw("jackal Rat.Fox cat.11")
The business:
> names(df) = sapply(names(df), function(n) mycols[grepl(n, mycols)])
The result:
> names(df)
[1] "jackal.fox11.FAD" "Rat.Fox.11.33.DDG" "cat.11.miss.DAD"
props to #luke-singham for basis of approach
qw defined in my .Rprofile as in https://stackoverflow.com/a/31932661/338303
If you can guarantee that the names are the same as here, this is quite simple. However, that situation is trivial, so there doesn't seem to be any value in the solution vs just names(df) <- myobject
names(df)[c(grep(myobject[1], mycolumns), grep(myobject[2], mycolumns), grep(myobject[3], mycolumns))] <- myobject
I have the below function which tracks no. of miles run by a person in a city and in town on different days. I have 3 columns. Id(of a person) City Town. For same values of Id i have different values of miles ran ina city and in a town or NA if no miles were run. So I can have a Id=1 in multiple rows with different values for city and town corresponding to Id=1 and similarly for Id=2 and so on. I have 500 csv files one for each Id and now I need to calculate the mean of and combination of ids and below is my function.
milesmean <- function(directory, place, id = 1:500){
if(directory == "miledata"){
files <- list.files()
data <- list()
for (i in 1:500){
data[[i]] = read.csv(files[[i]])
}
req.data <- vector("list", length = length(id))
for(j in id){
req.data[[j]] <- data[[j]]$place
}
mean(unlist(req.data), na.rm=TRUE)
}
}
But when I call milesmean("miledata","city",1:10) I get NA as value and warning message
Warning message:
In mean.default(unlist(req.data), na.rm = TRUE) :
argument is not numeric or logical: returning NA
Any reason why? TIA. Note: I need to solve this only by looping not using lapply and other similar functions
The line:
req.data[[j]] <- data[[j]]$place
is looking for a column literally called 'place' in your imported data.frame. If you wish to use the value supplied in the argument place you need to change it to:
req.data[[j]] <- data[[j]][[place]]
As there is no column called 'place' data becomes a list of NULLs, and these form a single NULL when unlisted, which is what causes the warning the mean function.
You can probably cut out the first loop too, leaving you with:
milesmean <- function(directory, place, id = 1:500){
if(directory == "miledata"){
files <- list.files()
req.data <- vector("list", length = length(id))
for(j in seq_along(id)){
req.data[[j]] <-read.csv(files[[id[j]]])[[place]]
}
mean(unlist(req.data), na.rm=TRUE)
}
}
to save reading files that you're not using for the mean.