R functions that output datasets - r

I am a bit new to R and am trying to use a function to output a dataframe. I have several dataframes that need deduplication. Each record in the data frame has an index variable (RecID) and a patient ID (PatID). If patients are listed multiple times in the dataframe, I want to choose the record largest RecID.
I want to be able to change this data frame:
PatID RecID
1 1
1 2
2 3
3 4
3 5
4 6
Into this dataframe
PatID RecID
1 2
2 3
3 5
4 6
I can use the following code to successfully deduplicate the dataframe.
df <- df[order(df$PatID, -df$RecID),]
df <- df[ !duplicated(df$PatID), ]
I created a function with this code so I can apply my deduplication scheme across multiple data frames easily.
dedupit <- function(x) {
x <- x[order(x$PatID, -x$RecID),]
x <- x[ !duplicated(x$PatID), ]
}
However, when I put use the code dedupit(df), it does not create a new df dataframe with deduplicated records.The function won't output the final dataframes or any of the intermediate dataframes. Is there a way to have functions output dataframes?

You need to put return(x) at the end of your function.

Related

Rbind: Add name of source as column and/or variable in the new combined dataset

I'm using Rbind to combine multiple datasets into one big dataframe.
For future reference, I want to be able to see from which dataset a row originates from.
Is there an easy way to do this without using the ID's or other 'hacks'?
Example of source files:
Sales_East <- (read.csv('salesEast.csv')
Sales_West<- (read.csv('salesWest.csv')
Dataset <- rbind.fill(Sales_East,Sales_West)
The resulting dataset:
ID Order Amount
1 2 10
2 1 5
A 4 20
B 2 10
But I'm looking for something more like this:
ID Order Amount Source
1 2 10 East
2 1 5 East
A 4 20 West
B 2 10 West
If it's only a couple dataframes you want to row-bind, just add the source yourself:
Sales_East <- read.csv('salesEast.csv')
Sales_East$Source <- "East"
Sales_West <- read.csv('salesWest.csv')
Sales_West$Source <- "West"
Dataset <- rbind.fill(Sales_East, Sales_West)
If you have a whole bunch of dataframes, you need to get their names in a character vector by either writing it yourself or using ls(). But once you have it, you can do this:
dfnames <- c("Sales_East", "Sales_West")
do.call(rbind, lapply(dfnames, function(x) cbind(get(x), Source=x)))

Select all the rows in R that have some attributes from other columns R [duplicate]

This question already has answers here:
Subset dataframe by multiple logical conditions of rows to remove
(8 answers)
Closed 6 years ago.
I want to search my dataset for those values that some attributes from multiple columns.
For that, I found that I can use grep like so:
df <- read.csv('example.csv', header = TRUE, sep='\t')
df[grep("region+druggable", df$locus_type=="region", df$drug_binary==1),]
But when I run this, my output is the different column names.
Why is this happening?
my dataframe is like this:
id locus_type drug_binary
1 pseudogene 1
2 unknown 0
3 region 1
4 region 0
5 phenotype_only 1
6 region 1
...
So ideally, I would expect to get the 3rd and 6th row as a result of my query.
If you want to use base R, the correct syntax is the following:
df[grepl("region|druggable",df$locus_type) & df$drug_binary==1,]
Which gives the following ouput:
id locus_type drug_binary
3 3 region 1
6 6 region 1
Since you want to combine logic vectors you need to use grepl that has a logic output.
Also I assumed you wanted to check for locus type equal to region or druggable, the correct logic for the regex in grepl is the one I used above.
I like dplyr for its of readability
library(dplyr)
subdf <- filter(df, locus_type=="region", drug_binary==1)
sometimes it can be helpful to use the sqldf library.
?sqldf
SQL select on data frames
Description
SQL select on data frames
this is how you could get the result you need:
# load the sqldf library
# if you get error "Error in library(sqldf) : there is no package called sqldf"
# you can install it simply by typing
# install.packages('sqldf') <-- please notice the quotes!
library(sqldf)
# load your input dataframe
input.dataframe <- read.csv('/tmp/data.csv', stringsAsFactors = F)
# of course it's a data.frame
class(input.dataframe)
# express your query in SQL terms
sql_statement <- "select * from mydf where locus_type='region' and drug_binary=1"
# create a new data.frame as output of a select statement
# please notice how the "mydf" data.frame automagically becomes a valid sqlite table
output.dataframe <- sqldf(sql_statement)
# the output of a sqldf 'select' statement is a data.frame, too
class(output.dataframe)
# print your output df
output.dataframe
id locus_type drug_binary
3 region 1
6 region 1

R: summing values of matched names and adding on new names' values

I am trying to a simple task, and created a simple example. I would like to add the counts of a taxon recorded in a vector ('introduced',below) to the counts already measured in another vector ('existing'), according to the taxon name. However, when there is a new taxon (present in introduced by not in existing), I would like this taxon and its count to be added as a new entry in the matrix (doesn't matter what order, but name needs to be retained).
For example:
existing<-c(3,4,5,6)
names(existing)<-c("Tax1","Tax2","Tax3","Tax4")
introduced<-c(2,2)
names(introduced)<-c("Tax1","Tax5")
I want new matrix, called "combined" here, to look like this:
#names(combined)= c("Tax1","Tax2","Tax3","Tax4","Tax5")
#combined= c(5,4,5,6,2)
The main thing to see is that "Tax1"'s values are combined (3+2=5), "Tax5" (2) is added on to the end
I have looked around but previous answers similar to this have much more complex data and it is difficult to extract which function I need. I have been trying combinations of match and which, but just cannot get it right.
grp <- c(existing,introduced)
tapply(grp,names(grp),sum)
#Tax1 Tax2 Tax3 Tax4 Tax5
# 5 4 5 6 2
Instead of keeping your data in 'loose' vectors, you may consider collecting them in one data frame. First, put you two sets of vector data in data frames:
existing <- c(3, 4, 5, 6)
taxon <- c("Tax1", "Tax2", "Tax3", "Tax4")
df1 <- data.frame(existing, taxon)
introduced <- c(2, 2)
taxon <- c("Tax1", "Tax5")
df2 <- data.frame(introduced, taxon)
Then merge the two data frames by the common column, 'taxon'. Set all = TRUE to include all rows from both data frames:
df3 <- merge(df1, df2, all = TRUE)
Finally, sum 'existing' and 'introduced' taxon, and add the result to the data frame:
df3$combined <- rowSums(df3[ , c("existing", "introduced")], na.rm = TRUE)
df3
# taxon existing introduced combined
# 1 Tax1 3 2 5
# 2 Tax2 4 NA 4
# 3 Tax3 5 NA 5
# 4 Tax4 6 NA 6
# 5 Tax5 NA 2 2

How to cross-tabulate two variables in R?

This seems to be basic, but I wont get it. I am trying to compute the frequency table in R for the data as below
1 2
2 1
3 1
I want to transport the the two way frequencies in csv output, whose rows will be all the unique entries in column A of the data and whose columns will be all the unique entries in column B of the data, and the cell values will be the number of times the values have occurred. I have explored some constructs like table but I am not able to output the values correctly in csv format.
Output of sample data:
"","1","2"
"1",0,1
"2",1,0
"3",1,0
The data:
df <- read.table(text = "1 2
2 1
3 1")
Calculate frequencies using table:
(If your object is a matrix, you could convert it to a data frame using as.data.frame before using table.)
tab <- table(df)
V2
V1 1 2
1 0 1
2 1 0
3 1 0
Write data with the function write.csv:
write.csv(tab, "tab.csv")
The resulting file:
"","1","2"
"1",0,1
"2",1,0
"3",1,0

efficient string value count in large data.frame

I have a large dataframe (~ 600K rows) with a string-value column (link)
doc_id,link
1,http://example.com
1,http://example.com
2,http://test1.net
2,http://test2.net
2,http://test5.net
3,http://test1.net
3,http://example.com
4,http://test5.net
and I would like to count the number of times a certain string value occurs in the frame. The result should look like this:
link, count
http://example.com, 3
http://test1.net, 2
http://test2.net, 1
http://test5.net, 2
Is there an efficient way to do this in R? Converting the frame into a matrix doesn't work because of the frame size. Currently I am using the plyr package, but this is too slow.
The table function counts occurrences - and it's very fast compared to ddply. So, something like this perhaps:
# some sample data
set.seed(42)
df <- data.frame(doc_id=1:10, link=sample(letters[1:3], 10, replace=TRUE))
cnt <- as.data.frame(table(df$link))
# Assign appropriate names (optional)
names(cnt) <- c("link", "count")
cnt
Which gives the following output:
link count
1 a 2
2 b 3
3 c 5

Resources