Alternative to using table in R? - r

I have a function called notes_count(id) that takes a vector as a parameter (for example the function can accept different arguments 5, c(1,2,3), 6:20, or 5:1 to name a few) and returns the ID and "count" of the notes. I have a data frame with the following contents:
"ID" "Date" "Notes"
that contains an unknown amount of entries per "ID" for example:
ID Date Notes
1 xxx "This is a note"
1 xxx "More notes here"
...
8 xxx "Hello World"
The problem I am running into is that I want the output to be ordered in the same way as the input vector meaning notes_count(3:1) should list the results in reverse order as a data frame:
ID notes_count
1 3 6
2 2 288
3 1 102
and calling notes_count(1:3) would result in:
ID notes_count
1 1 102
2 2 288
3 3 6
however table always reorders from min to max despite the order it is given originally. Is there a way to do what table is doing directly on the data frame but using other functions so that I can control the output.
Currently my code is this:
#Before calling table I have data frame "notes" in the order I want but table reorders it
notes_count <- as.data.frame(table(notes[["ID"]]))
which seems silly to make the original data frame a table and then convert it back.
EDIT:
Here is my code as basic as it is as requested
notes_count <- function(id){
## notes.csv format
## "ID","Date","Notes"
## 1,"2016-01-01","Some notes"
#read the csv to a data frame
notes <- read.csv("notes.csv")
#remove all NA values
notes <- notes[complete.cases(notes), ]
#here is where you can order the data but it won't matter when aggregating the notes to a "count" using table on the next line
notes <- notes[id, ]
#convert the table back to a data frame
notes_count <- as.data.frame(table(notes[["ID"]]))
notes_count
}

Here's a simplified example that should get you going:
set.seed(1234)
notes <- data.frame(id=sample(2:10,size = 100, replace = TRUE), Note="Some note")
notes_count <- function(id) {
counts <- table(notes[notes$id %in% id,])
return(data.frame(count=counts[as.character(id),]))
}
notes_count(c(10,2,5))
# Results
count
10 8
2 12
5 2

If I understand correctly, you want to sort the dataframe by the notes_count variable?
then use order function and reshuffle the df rows.
your_data_frame[order(your_data_frame$notes_count,decreasing=TRUE),]

Related

Replacing values based on multiple columns in R

I have raw data with multiple observation and I have a cleaning log which contains some new values for specific columns of raw data I want to replace old values with these new ones.
My raw data is :
raw_df<- data.frame(
id=c(1,2,3,4),
name=c("a","b","c","d"),
age=c(15,16,20,22),
add=c("xyz","bc","no","da")
)
MY cleaning log is :
cleaning_log <- data.frame(
id=c(2,4),
question=c("name","age"),
old_value=c("b",22),
new_value=c("bob",25)
)
And my expected result is :
result<-data.frame(
id=c(1,2,3,4),
name=c("a","bob","c","d"),
age=c(15,16,20,25),
add=c("xyz","bc","no","da")
)
Note:At the end how can I check whether these new values are replaced properly or not?
In addition, in cleaning log question column I may have more than two columns like 10 to 20 which possibly will have new value but here I just give two column names as an example.
Thanks in advance for you help
Find out the row number and column number to change in raw_df using match and replace it with cleaning_log$new_value.
row_inds <- match(cleaning_log$id, raw_df$id)
col_inds <- match(cleaning_log$question, names(raw_df))
raw_df[cbind(row_inds, col_inds)] <- cleaning_log$new_value
raw_df
# id name age add
#1 1 a 15 xyz
#2 2 bob 16 bc
#3 3 c 20 no
#4 4 d 25 da

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Set up column names in a new data frame based on variable

My goal is to be able to allocate column names to a data frame that I create based on a passed variable. For instance:
i='column1'
data.frame(i=1)
i
1 1
Above the column name is 'i' when I want it to be 'column1'. I know the following works but isn't as efficient as I'd like:
i='column1'
df<-data.frame(x=1)
setnames(df,i)
column1
1 1
It's good to learn how base R works this way:
i <- 'cloumn1'
df <- `names<-`(data.frame(1), i)
df
# cloumn1
#1 1
Aside from the answers posted by other users, I think you may be stuck with the solution you've already presented. If you already have a data frame with the intended number of rows, you can add a new column using brackets:
df <- data.frame('column1'=1)
i <- 'column2'
df[[i]] <- 2
df
column1 column2
1 2
If the idea is to get rid of the setNames, you would probably never do this but
i <- 'column1'
data.frame(`attr<-`(list(1), "names", i))
# column1
# 1 1
You can see in data.frame, it has the code
x <- list(...)
vnames <- names(x)
so, you can mess with the name attribute.
Not exactly sure how you want it more efficient but you could add all the column names at once after your data frame has been assembled with colnames. Here's an example based on yours.
data.frame(Td)
a b
1 1 4
2 1 5
nam<-c("Test1","Test2")
colnames(Td)<-nam
data.frame(Td)
Test1 Test2
1 1 4
2 1 5
You could simply pass the name of your column variable and its values as arguments to a dataframe, without adding more lines:
df <- data.frame(column1=1)
df
# column1
#1 1

compare values of data frames with different number of rows

I defined the following function, which takes two DataFrames, DF_TAGS_LIST and DF_epc_list. Both data frames have a column with a different number of rows. I want to search each value DF_TAGS_LIST in DF_epc_list, and if found, store it in another dataframe
One example of DF_TAGS_LIST:
TAGS_LIST
3036029B539869100000000B
3036029B537663000000002A
3036029B5398694000000009
3036029B539869400000000C
3036029B5398690000000006
3036029B5398692000000007
And one example of DF_epc_list:
EPC
3036029B539869100000000B
3036029B537663000000002A
3036029B5398690000000006
3036029B5398692000000007
3036029B5398691000000006
3036029B5376630000000034
3036029B53986940000000WF
3036029B5398694000000454
3036029B5398690000000234
3036029B53986920000000FG
In this case, I would like one dataframe output that had the following values:
FOUND_TAGS
3036029B5398690000000006
3036029B5398692000000007
3036029B539869100000000B
3036029B537663000000002A
My function is:
FOUND_COMPARE_TAGS<-function(DF_TAGS_LIST, DF_epc_list){
DF_epc_list<-toString(DF_epc_list)
DF_TAGS_LIST<-toString(DF_TAGS_LIST)
DF_found_epc_tags <- data.frame(DF_found_epc_tags=intersect(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list)); setdiff(union(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list), DF_found_epc_tags$DF_found_epc_tags)
#DF_found_epc_tags <- data.frame(DF_found_epc_tags = DF_TAGS_LIST[unique(na.omit(match(DF_epc_list$DF_epc_list, DF_TAGS_LIST$DF_TAGS_LIST))),])
return(DF_found_epc_tags)
}
I now returns an empty data frame with two columns. Only recently programmed in R
You can use %in% or (as I mentioned in my comment) intersect:
DF_TAGS_LIST[DF_TAGS_LIST$TAGS_LIST %in% DF_epc_list$EPC, , drop = FALSE]
# TAGS_LIST
# 1 3036029B539869100000000B
# 2 3036029B537663000000002A
# 5 3036029B5398690000000006
# 6 3036029B5398692000000007
intersect(DF_TAGS_LIST$TAGS_LIST, DF_epc_list$EPC)
# [1] "3036029B539869100000000B" "3036029B537663000000002A"
# [3] "3036029B5398690000000006" "3036029B5398692000000007"
FOUND_TAGS <- rbind(TAGS_LIST, EPC)
FOUND_TAGS <- FOUND_TAGS[duplicated(FOUND_TAGS), , drop = FALSE]

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Resources