R: for-loop solution to deleting columns from multiple data frames - r

My question is probably quite simple but I think my code could definitely be improved. Right now it's two for-loops but I'm sure there's a way to do what I need in a single loop, for the life of me I can't see what it is.
Having searched Stack, I found this excellent answer from Ananda where he was able to extract and keep columns within a range using lapply and for-loop methods. The structure of my data gets in the way, however, as I want to be able to pick specific columns to delete. My data structure looks like this:
1 AAAT_1 1 GROUP **** 1 -13.70 0
2 AAAT_2 51 GROUP **** 1 -9.21 0
3 AAAT_3 101 GROUP **** 1 -7.60 0
4 AAAT_4 151 GROUP **** 1 -6.28 0
It's extract from some docking software and the only columns I want to keep are 2 (e.g. AAAT_1) and 7 (e.g. -13.70). The code I've used to do it, two for-loops:
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[2:7])
}
....to keep the data from columns 2-7, followed by:
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[-2:-5])
}
....to delete the rest of the columns I didn't need, where temp[i] is just a list of data frames the loop is acting on.
So, as you can see, it's just two loops doing similar actions. Surely there's a way to be able to pick specific columns to keep/delete and do it all in one loop/lapply statement? Trying things like [2,7] in the get statement doesn't work, appears to keep only column 7 and turns each data frame into 'Values' instead. I'm not sure what's going so any insight there would be wonderful but, either way, if anyone can turn this two-loop solution into one, would be really appreciated. Definitely feel like I'm missing something really simple/obvious.
Cheers.
EDIT: Have taken into account the vectorised solutions from below to do the following instead. The names of raw imported data start with stuff like F0001, F0002, etc. hence the pattern to make the initial list.
lst <- mget(ls(pattern='^F\\d+'))
lst <- lapply(lst, "[", TRUE, c("V2","V7") )
lst <- lapply(seq_along(lst),
function(i,x) {assign(paste0(temp[i]),x[[i]], envir=.GlobalEnv)},
x=lst)
I know loops get a bad rap in R, was a natural solution to me as a CPP programmer but meh, this was far quicker. Initially, the only downside from the other example was that the assign command pasted a letter to each of the created tables in sequence 1,2,3,....,n when the list of raw imported data files weren't entirely in numerical order (i.e. 1,2,3,5,6,10,...etc.) so this didn't preserve that order. So I had to use a list of the files (our old friend temp) to name them correctly. Minor thing and the code isn't much shorter than two loops but it's most certainly faster.
So, in short, the above three lines add all the imported raw data to a list, keep only the columns I need then split the list up into separate dataframes whilst preserving the correct names. Cheers for the help!

If you have a data frame, you index rows and columns with
data.frame[row, column]
So, data.frame[2,7]) will give you the value of the 2nd row in the 7th column. I guess you were looking for
temp <- temp[, c(2,7)]
or, if temp is a list of data frames
temp <- lapply(temp, function(x) x[, c(2,7)])
So, if you want to use a vector of numbers as column- or row-indices, create this vector with c(...). If I understand your example right, you don't need any loop-command, if you use lapply.

A for loop? Maybe I'am missing something but just why do not use the solution proposed by #Daniel or a dplyr approach like this.
data
V1 V2 V3 V4 V5 V6 V7 V8
1 1 AAAT_1 1 GROUP **** 1 -13.70 0
2 2 AAAT_2 51 GROUP **** 1 -9.21 0
3 3 AAAT_3 101 GROUP **** 1 -7.60 0
4 4 AAAT_4 151 GROUP **** 1 -6.28 0
and here the code:
library(dplyr)
data <- select(data, V2, V7)
data
V2 V7
1 AAAT_1 -13.70
2 AAAT_2 -9.21
3 AAAT_3 -7.60
4 AAAT_4 -6.28

Related

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

Get rows with common value into lists

I'm trying to get the rows according to the values in the "Type of region" column into lists and put these lists into a other data structure (vector or list).
The data looks like this (~700 000 lines):
chr CS CE CloneName score strand # locs per clone # capReg alignments Type of region
chr1 10027684 10028042 clone_11546 1 + 1 1 chr1_10027880_10028380_DNaseI
chr1 10027799 10028157 clone_11547 1 + 1 1 chr1_10027880_10028380_DNaseI
chr1 10027823 10028181 clone_11548 1 - 1 1 chr1_10027880_10028380_DNaseI
chr1 10027841 10028199 clone_11549 1 + 1 1 chr1_10027880_10028380_DNaseI
Here's what i tried to do:
typeReg=dat[!duplicated(dat$`Type of region`),]
for(i in 1:nrow(typeReg)){
res[[i]]=dat[dat$`Type of region`==typeReg[i,]$`Type of region`,]
}
The for loop took too much time so i tried using an apply:
res=apply(typeReg, 1, function(x){
tmp=dat[dat$`Type of region`==x[9],]
})
But it is also long (there are 300 000 unique values in the Type of region column).
Do you have a solution to my problem or is it normal that it's taking this long?
You can use split():
type <- as.factor(dat$Type_of_Region)
split(dat, type)
But, as stated in the comments, using dplyr::group_by() may be a better option depending on what you want to do later.
Ok, so split works but the subsetting doesn't drop levels of the factor i have in my df. So basically for every list the split function created, it brought the 300 000 levels in the original df thus the huge size of the list. The possible solutions are to use the droplevels() function on every list created (not optimal if one list is too big to store in the RAM), use a for loop (this solution is really slow) or to remove the columns that cause a problem which is what i did.
res=split(dat[,c(-4,-9)], dat$`Type of region`, drop=TRUE)

Find the number of occurrences of a set of two columns of a data frame in other data frames in r

I have 103 data frames with 7 variables and more than 1000 rows. I want to find the number of occurrences of a pair of two columns of one data frame in other 102 data frames. In other words, how many times c(V1,V2) together (=two columns of a data frame together) can be seen in other 102 data frames.
I've already written a code, but it is very slow!
I put all 103 data frames in a list and convert it to a data frame. Then make a for loop to read each data frame one by one. and in each loop i have another for loop to search for each row of the data frame in that list!
The main part of the codes is as follows:
for(i in file){
input<-read.table(i)
for(j in 1:1000){
df1<- data.table(input[j,c(1,3)])
count<-merge(df1,dt, c("V1", "V3")) //dt is a data frame includes all 103 data frames
df1["count"]<-nrow(count)
}
}
In this way, I can count how many times set of V1 and V3 of a data frame, comes in other data frames. But obtaining the whole results needs more than 50 days!
I wonder if anyone can help me with a faster way to obtain my desired results.
Example of the data frames (just 5 variable are considered here):
V1 V2 V3 V4 V5
1 Q0 abc 34 3
1 Q0 abd 31 9
1 Q0 bac 32 3
1 Q0 cba 56 0
2 Q0 zxc 37 3
2 Q0 fgc 30 3
2 Q0 ghc 36 3
In fact, I want to find out how many times each value of V3 comes in other data frames but because V3 and V1 are dependent. I must consider V1 in my search as well. So, I have to see how many times c(V1,V3) comes in other data frames. For example (1,abc) together! or (1, abd).
dt has the same structure as the data frames but it includes all data from all data frames that I have!
I will attempt an answer but quite frankly I am not sure I have understood your problem. You also don't give enought data for us to work on so it is difficult to find a solution to your problem. However, here it goes. I have commented out the lines which might be problematic and used some of my own. I will be glad to help further if this will get you closer.
V=vector("list",length(file))
cnt=1;
for(i in file){
#input<-read.table(i)
# Use fread to read the file. It is vert fast
dt<-fread(i)[,c(1,3), with=FALSE]
# Create a dummy column which we will sum eventually
dt[,VAL:=1] #
#dt<-merge(dt,df1, by=c('V1','V3'),all.x=TRUE)
# Add in the list-vector to create the big data.table in the end
V[[cnt]]=dt;
cnt=cnt+1
# You don't need a for-loop to merge line by line.
#for(j in 1:1000){
#df1<- data.table(input[j,c(1,3)])
#count<-merge(df1,dt, c("V1", "V3")) //dt is a data frame includes all 103 data frames
#df1["count"]<-nrow(count)
#}
}
# Create a big data.table
V<-rbindlist(V);
#Aggregate on V1 and V3 and see how many lines are there.
V[,lapply(.SD,sum,na.rm=TRUE),by=c('V1','V3')]
I hope this helps. Otherwise, if you somehow would upload a file sample that would make things easier.
Thanks

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Merging databases in R on multiple conditions with missing values (NAs) spread throughout

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Resources