I'm wondering if there is a good way to delete multiple columns over a few different data sets in R. I have a data set that looks like:
RangeNumber Time Value Quality Approval
1 2:00 1 1 1
2 2:05 4 2 1
And I want to delete everything but the Time and Value columns in my data sets. I'm "deleting" them by setting each column to NULL, e.x.: data1$RangeNumber <- NULL.
I'm going to have upwards of 16 or more data sets with identical column setups, and data sets are going to be numbered in incremental order, e.x.: data1, data2, data3, &c.
I'm wondering if a for loop that iterates through all of the data set columns is the best way to accomplish this, or -- since I have read that R is slow at for loops-- if there is an easier way to do this. I'm also wondering if I need to combine all of my data sets into one variable, and then iterate through to remove the columns.
If a for loop is the best way to go, how would I set it up?
You want to gather those dataframes into a list and then run the Extract function over them. The first argument given to "[" should be TRUE so that all rows are obtained, and the second argument should be the column names (I made up three dataframes that varied in their row numbers and column names but all had 'Time' and 'Value' columns:
> datlist <- list(dat1,dat2,dat3)
> TimVal <- lapply(datlist, "[", TRUE, c("Time","Value") )
> TimVal
[[1]]
Time Value
1 2:00 1
2 2:05 4
[[2]]
Time Value
1 2:00 1
2 2:05 4
[[3]]
Time Value
1 2:00 1
2 2:05 4
2.1 2:05 4
1.1 2:00 1
This is added in case the goal was to have them all together in the same dataframe:
> do.call(rbind, TimVal)
Time Value
1 2:00 1
2 2:05 4
3 2:00 1
4 2:05 4
11 2:00 1
21 2:05 4
2.1 2:05 4
1.1 2:00 1
If you are very new to R you may not have figured out that the last code did not change TimVal; it only showed what value would be returned and to make the effect durable you would need to assign to a name. Perhaps even the same name:
TimVal <- do.call(rbind, TimVal):
Rather than delete, just choose the columns that you want, i.e.
data1 = data1[, c(2, 3)]
The question still remains about your other data sets: data2, etc. I suspect that since your data frames are all "similar", you could combine them into a single data frame with an additional identifier column, id, which tells you the data set number. How you combine your data sets depends on how you data is stored. But typically, a for loop over read.csv is the way to go.
I'm not sure if I should recommend these since these are pretty "destructive" methods.... Be sure that you have a backup of your original data before trying ;-)
This approach assumes that the datasets are already in your workspace and you just want new versions of them.
Both of these are pretty much the same. One option uses lapply() and the other uses for.
lapply
lapply(ls(pattern = "data[0-9+]"),
function(x) { assign(x, get(x)[2:3], envir = .GlobalEnv) })
for
temp <- ls(pattern = "data[0-9+]")
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[2:3])
}
Basically, ls(.etc.) will create a vector of datasets in your workspace matching the naming pattern you provide. Then, you write a small function to select the columns you want to keep.
A less "destructive" approach would be to create new data.frames instead of overwriting the original ones. Something like this should do the trick:
lapply(ls(pattern = "data[0-9+]"),
function(x) { assign(paste(x, "T", sep="."),
get(x)[2:3], envir = .GlobalEnv) })
Related
I have 20 excel files containing city level data for each year. I imported them in a list because I thought it will be easier to loop over them.
The first task that I wanted to do is to change the name of the second column of each file.
If, for a single file I do:
#data is a list of data tables/frames. Example:
data<-list(a = data.frame(1:2,3:4),b = data.frame(5:8,15:18) )
#renaming first column of a (works)
names(data[[1]])[2]<-"ABC"
I am able to rename the column.
To do batch editing I wanted to write a function to be used in lapply. The function should be a simple version of the above thing:
rename <-function(df){
names(df)[2]<-"XYZ"}
Rename(data[[1]]) however, does nothing to the second column. Any ideas why?
You need to return the full modified object at each iteration:
data <- lapply( data, function(x) {names(x)[2]<-"ABC"; x})
data
#---------
[[1]]
X1.2 ABC
1 1 3
2 2 4
[[2]]
X5.8 ABC
1 5 15
2 6 16
3 7 17
4 8 18
I'm sure this is a duplicate but I don't know what the right search terms might be, so I'm just answering it .... again.
I'm working on a rather lengthy shared R program which processes client data and references things like the name of the time variables supplied by each client (which obviously changes at almost every client submission).
What I wanted to do is to set the name of (say) a timeseries variable to WEEK and be able to reference timeseries throughout the code so that I only need to change the one section of code right at the top:
TOP OF CODE
timeseries <- "WEEK"
EXAMPLE MID CODE
summary_transposed_no_time = summary_transposed_no_missing
summary_transposed_no_time$timeseries <- NULL
I have found that this approach does work for things like sqldf steps as the below is working just fine. Ideally I want to use this approach across both R logic and SQL logic as the program is very lengthy and a lot of it is written in SQL which I would love to avoid re-writing:
dataset <- "client_a_data"
response <- "SALE"
timeseries <- "WEEK"
region <- "POSTAL_DIST"
summary <- sqldf(paste("SELECT",timeseries,
",",region,
",sum(",response,") AS", response,
"FROM", "dataset",
"GROUP BY",timeseries,"," ,region,
"ORDER BY",timeseries,"," ,region
)
)
I think I see what you're trying to achieve, but let me know if I'm off track...
One way I can see to do this would be to build a search for the appropriate column early in your script, and use the returned value from then on to refer to column.
df <- data.frame( data = rnorm( 20, 1, 1 ), day = seq_len( 20 ) )
df$week <- ((df$day - 1) %/% 7) + 1
Now we can specify your timeseries variable as any of the columns in the frame:
timeseries <- "week"
Then, somewhere in our script, have something like this to extract a reference for the column:
timeColumn <- match( timeseries, names( df ) )
Which now allows you to refer to that column as many times as you like in your script:
df[, timeColumn]
Any time you change that "week" value to, say "day", the rest of your script will now change to refer to that instead.
Just a note, if you do go this route, be careful to either not move columns around (making your reference value stop working correctly) or have the match call run each time you want to refer to the column (this would allow you to move columns around if you need to).
You can refer to any column by name directly. Variables response, timeseries and region are as defined in the question.
# generate some data
client_a_data <- data.frame(SALE=100:104, WEEK=1:5, POSTAL_DIST=60000:60004)
# read in data
dataset <- ... # whatever code you use to upload the client_a_data
# here:
dataset <- "client_a_data"
dataset <- get(dataset)
dataset
SALE WEEK POSTAL_DIST
1 100 1 60000
2 101 2 60001
3 102 3 60002
4 103 4 60003
5 104 5 60004
# refer to any column by its pre-defined name
dataset[, timeseries]
[1] 1 2 3 4 5
dataset[, c(response, region)]
SALE POSTAL_DIST
1 100 60000
2 101 60001
3 102 60002
4 103 60003
5 104 60004
So your specific line that would delete the WEEK column should read:
summary_transposed_no_time[, timeseries] <- NULL
Or you might wish to rename the pertaining columns at the beginning of your code to whatever text appears throughout.
colnames(dataset)[match(c(timeseries, response, region), colnames(dataset))] <- c("timeseries", "response", "region")
I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.
Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3