Bind Tables into one data frame - r

I have searched a couple of options, generally trying out various combinations on cbind to accomplish this. Essentially I would like to create a data frame that combines different pivot tables. into one data frame in order to export to csv/excel. Is there a better way to accomplish this?
EDIT: Essentially I am trying to learn the basics of creating a function that can wrap around multiple different pivot tables to create a data frame ready for export that will serve as a template to ad hoc reporting. The problem I am having is that the cbind product takes object B, which as a standalone will be a table with the dates as columns, and forces it into a long table, where the dates are transposed into rows.
dataframe:
State FacilityName Date
NY Loew June 2014
NY Loew June 2014
CA Sunrise May 2014
CA May 2014
code:
volume <- function() {
df$missing = ifelse(is.na(df$FacilityName), "Missing", df$FacilityName)
df = subset(df, df$missing == "Missing")
x <- function(){
a <- as.data.frame(table(df$FacilityName))
b <- table(df$FacilityName, df$date)
cbind(a, b[,1], b[2])
}
}

When you give a factor to the table function, it uses the levels of the factor to build the table. So there's a nice way to obtain what you want by adding "Missing" to the levels of "FacilityName".
# loading data
ec <- read.csv(text=
'State, FacilityName, Date
NY,Loew,June 2014
NY,Loew,June 2014
CA,Sunrise,May 2014
CA,NA,May 2014', )
# Adding Missing to the possible levels of FacilityName
# note that we add it in front
new.levels <- c("Missing", levels(ec$FacilityName))
ec$FacilityName <- factor(ec$FacilityName, levels=new.levels)
# And replacing NAs by the new level "Missing"
ec$FacilityName[is.na(ec$FacilityName)] <- "Missing"
# the previous line would not have worked
# if we had not added "Missing" explicitly to the levels
# table() uses the levels to generate the table
# the levels are displayed in order
# now there's a level "Missing" in first position
t <- table(ec$FacilityName, ec$Date)
You get:
> t
June 2014 May 2014
Missing 0 1
Loew 2 0
Sunrise 0 1
You can add the total line like this (I don't think your code with nrow do what you say it does)
# adding total line
rbind(t, TOTAL=colSums(as.matrix(t)))
June 2014 May 2014
Missing 0 1
Loew 2 0
Sunrise 0 1
TOTAL 2 2
At this point you have a matrix so you may want to pass it to as.data.frame.
This can be easily implemented into a separate function if you want to. No need to bind several tables after all :)

Ok, so it seems I was trying to be cool and use a function to wrap everything in the hopes that it would be the beginning of learning to write flexible code. But, I did it the long way and ended up getting the result I wanted. While I will post the code that worked below, I am very interested in someone pointing me towards a better way to approach these kinds of problems, in order to learn better coding.
# Label the empty cells as Missing
ec$missing = ifelse(is.na(ec$FacilityName), "Missing", ec$FacilityName)
# Subset the dataframe to just missing values
df = subset(ec, ec$missing == "Missing")
# Create table that is a row of frequency by month for missing values
a <- table(df$missing, df$date)
# Reload dataframe to exclude Missing values
df = subset(ec, ec$missing != "Missing")
# Create table that shows frequency of observations for each facility by Month
b <- table(df$FacilityName, df$date)
# Create a Total row that can go at the bottom of the final data frame
Total <- nrow(ec)
# Bind all three objects
rbind(a,b,Total)
Here is an example of the final product I was looking for:
May2014 June2014
Missing 2 0
Sunrise 0 0
Loew 1 2
Total 3 2

Related

Delete missing datapoints (NA's) from multiple vectors

So I am working with biological data at a hospital, (I won't disclose anything here but I won't need to in order to ask this question). We are looking at concentrations of antibodies taken a certain amount of time. There are, for one reason or another, missing data points all over our data set. What I am doing is trying to remove the missing data points along with their corresponding time. Right now the basic goal is just to get some basic graphs and charts up and running but eventually we're going to want to create some logistical models and nonlinear dynamics models which we'll do in another language.
1) First I put my data into a vector along with it's corresponding time:
data <- read.csv("blablabla.csv" header = T)
Biomarker <- data[,2]
time <- data[,1]
2)Then I sort the data:
Biomarker <- Biomarker[order(time)]
time <- sort(time, decreasing = F)
3)Then I put the indexes of the NA values into a vector
NA_Index <- which(is.na(Biomarker))
4)Then I try to remove the data points at that index for both the biomarker and time vector
i <- 1
n <- length(NA_Index)
for(i:n){
Biomarker[[NA_Index[i]]] <- NULL
time[[NA_Index[i]]] <- NULL
}
Also I have tried a few different things than the one above:
1)
Biomarker <- Biomarker[-NA_Index[i]]
2)
Biomarker <- Biomarker[!= "NA"]
My question is: "How do I remove NA values from my vectors and remove the time with the same index?"
So Obviously I am very new to R and might be going about this in a completely wrong. I just ask that you explain all what all the functions do if you post some code. Thanks for the help.
First I'd recommend storing your data in a data.frame instead of two vectors, since the entries in the vecotors correspond to cases this is a more appropriate datastructure.
my_table <- data.frame(time=time, Biomarker=Biomarker)
Then you can simply subset the whole data.frame, the first dimension are rows, the second columns, as usual, leave the second dimension free to keep all columns.
my_table <- my_table[!is.na(my_table$Biomarker), ]
> BioMarker
[1] 1 2 NA 3 NA 5
> is.na(BioMarker)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> BioMarker[is.na(BioMarker)]
[1] NA NA
> BioMarker[! is.na(BioMarker)]
[1] 1 2 3 5
> BioMarker <- BioMarker[! is.na(BioMarker)]
> BioMarker
[1] 1 2 3 5

Is there a way to set character values at the top of an R program such that you don't have to make multiple changes at each run

I'm working on a rather lengthy shared R program which processes client data and references things like the name of the time variables supplied by each client (which obviously changes at almost every client submission).
What I wanted to do is to set the name of (say) a timeseries variable to WEEK and be able to reference timeseries throughout the code so that I only need to change the one section of code right at the top:
TOP OF CODE
timeseries <- "WEEK"
EXAMPLE MID CODE
summary_transposed_no_time = summary_transposed_no_missing
summary_transposed_no_time$timeseries <- NULL
I have found that this approach does work for things like sqldf steps as the below is working just fine. Ideally I want to use this approach across both R logic and SQL logic as the program is very lengthy and a lot of it is written in SQL which I would love to avoid re-writing:
dataset <- "client_a_data"
response <- "SALE"
timeseries <- "WEEK"
region <- "POSTAL_DIST"
summary <- sqldf(paste("SELECT",timeseries,
",",region,
",sum(",response,") AS", response,
"FROM", "dataset",
"GROUP BY",timeseries,"," ,region,
"ORDER BY",timeseries,"," ,region
)
)
I think I see what you're trying to achieve, but let me know if I'm off track...
One way I can see to do this would be to build a search for the appropriate column early in your script, and use the returned value from then on to refer to column.
df <- data.frame( data = rnorm( 20, 1, 1 ), day = seq_len( 20 ) )
df$week <- ((df$day - 1) %/% 7) + 1
Now we can specify your timeseries variable as any of the columns in the frame:
timeseries <- "week"
Then, somewhere in our script, have something like this to extract a reference for the column:
timeColumn <- match( timeseries, names( df ) )
Which now allows you to refer to that column as many times as you like in your script:
df[, timeColumn]
Any time you change that "week" value to, say "day", the rest of your script will now change to refer to that instead.
Just a note, if you do go this route, be careful to either not move columns around (making your reference value stop working correctly) or have the match call run each time you want to refer to the column (this would allow you to move columns around if you need to).
You can refer to any column by name directly. Variables response, timeseries and region are as defined in the question.
# generate some data
client_a_data <- data.frame(SALE=100:104, WEEK=1:5, POSTAL_DIST=60000:60004)
# read in data
dataset <- ... # whatever code you use to upload the client_a_data
# here:
dataset <- "client_a_data"
dataset <- get(dataset)
dataset
SALE WEEK POSTAL_DIST
1 100 1 60000
2 101 2 60001
3 102 3 60002
4 103 4 60003
5 104 5 60004
# refer to any column by its pre-defined name
dataset[, timeseries]
[1] 1 2 3 4 5
dataset[, c(response, region)]
SALE POSTAL_DIST
1 100 60000
2 101 60001
3 102 60002
4 103 60003
5 104 60004
So your specific line that would delete the WEEK column should read:
summary_transposed_no_time[, timeseries] <- NULL
Or you might wish to rename the pertaining columns at the beginning of your code to whatever text appears throughout.
colnames(dataset)[match(c(timeseries, response, region), colnames(dataset))] <- c("timeseries", "response", "region")

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Deleting multiple columns in different data sets in R

I'm wondering if there is a good way to delete multiple columns over a few different data sets in R. I have a data set that looks like:
RangeNumber Time Value Quality Approval
1 2:00 1 1 1
2 2:05 4 2 1
And I want to delete everything but the Time and Value columns in my data sets. I'm "deleting" them by setting each column to NULL, e.x.: data1$RangeNumber <- NULL.
I'm going to have upwards of 16 or more data sets with identical column setups, and data sets are going to be numbered in incremental order, e.x.: data1, data2, data3, &c.
I'm wondering if a for loop that iterates through all of the data set columns is the best way to accomplish this, or -- since I have read that R is slow at for loops-- if there is an easier way to do this. I'm also wondering if I need to combine all of my data sets into one variable, and then iterate through to remove the columns.
If a for loop is the best way to go, how would I set it up?
You want to gather those dataframes into a list and then run the Extract function over them. The first argument given to "[" should be TRUE so that all rows are obtained, and the second argument should be the column names (I made up three dataframes that varied in their row numbers and column names but all had 'Time' and 'Value' columns:
> datlist <- list(dat1,dat2,dat3)
> TimVal <- lapply(datlist, "[", TRUE, c("Time","Value") )
> TimVal
[[1]]
Time Value
1 2:00 1
2 2:05 4
[[2]]
Time Value
1 2:00 1
2 2:05 4
[[3]]
Time Value
1 2:00 1
2 2:05 4
2.1 2:05 4
1.1 2:00 1
This is added in case the goal was to have them all together in the same dataframe:
> do.call(rbind, TimVal)
Time Value
1 2:00 1
2 2:05 4
3 2:00 1
4 2:05 4
11 2:00 1
21 2:05 4
2.1 2:05 4
1.1 2:00 1
If you are very new to R you may not have figured out that the last code did not change TimVal; it only showed what value would be returned and to make the effect durable you would need to assign to a name. Perhaps even the same name:
TimVal <- do.call(rbind, TimVal):
Rather than delete, just choose the columns that you want, i.e.
data1 = data1[, c(2, 3)]
The question still remains about your other data sets: data2, etc. I suspect that since your data frames are all "similar", you could combine them into a single data frame with an additional identifier column, id, which tells you the data set number. How you combine your data sets depends on how you data is stored. But typically, a for loop over read.csv is the way to go.
I'm not sure if I should recommend these since these are pretty "destructive" methods.... Be sure that you have a backup of your original data before trying ;-)
This approach assumes that the datasets are already in your workspace and you just want new versions of them.
Both of these are pretty much the same. One option uses lapply() and the other uses for.
lapply
lapply(ls(pattern = "data[0-9+]"),
function(x) { assign(x, get(x)[2:3], envir = .GlobalEnv) })
for
temp <- ls(pattern = "data[0-9+]")
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[2:3])
}
Basically, ls(.etc.) will create a vector of datasets in your workspace matching the naming pattern you provide. Then, you write a small function to select the columns you want to keep.
A less "destructive" approach would be to create new data.frames instead of overwriting the original ones. Something like this should do the trick:
lapply(ls(pattern = "data[0-9+]"),
function(x) { assign(paste(x, "T", sep="."),
get(x)[2:3], envir = .GlobalEnv) })

Resources