Get rows with common value into lists - r

I'm trying to get the rows according to the values in the "Type of region" column into lists and put these lists into a other data structure (vector or list).
The data looks like this (~700 000 lines):
chr CS CE CloneName score strand # locs per clone # capReg alignments Type of region
chr1 10027684 10028042 clone_11546 1 + 1 1 chr1_10027880_10028380_DNaseI
chr1 10027799 10028157 clone_11547 1 + 1 1 chr1_10027880_10028380_DNaseI
chr1 10027823 10028181 clone_11548 1 - 1 1 chr1_10027880_10028380_DNaseI
chr1 10027841 10028199 clone_11549 1 + 1 1 chr1_10027880_10028380_DNaseI
Here's what i tried to do:
typeReg=dat[!duplicated(dat$`Type of region`),]
for(i in 1:nrow(typeReg)){
res[[i]]=dat[dat$`Type of region`==typeReg[i,]$`Type of region`,]
}
The for loop took too much time so i tried using an apply:
res=apply(typeReg, 1, function(x){
tmp=dat[dat$`Type of region`==x[9],]
})
But it is also long (there are 300 000 unique values in the Type of region column).
Do you have a solution to my problem or is it normal that it's taking this long?

You can use split():
type <- as.factor(dat$Type_of_Region)
split(dat, type)
But, as stated in the comments, using dplyr::group_by() may be a better option depending on what you want to do later.

Ok, so split works but the subsetting doesn't drop levels of the factor i have in my df. So basically for every list the split function created, it brought the 300 000 levels in the original df thus the huge size of the list. The possible solutions are to use the droplevels() function on every list created (not optimal if one list is too big to store in the RAM), use a for loop (this solution is really slow) or to remove the columns that cause a problem which is what i did.
res=split(dat[,c(-4,-9)], dat$`Type of region`, drop=TRUE)

Related

Assign multiple columns via vector without recycling

I am importing measurement data as a dataframe and want to include the experimental conditions in the data which are given in the filename. I want to add new columns to the dataframe that represent the conditions, and I want to assign the columns with the value specified by the filename. Later, this will facilitate comparisons to other experimental conditions once I merge the editted dataframes from each individual sample/file.
Here is an example of my pre-existing dataframe Measurements:
Measurements <- data.frame(
X = 1:4,
Length = c(130, 150, 170, 140)
)
Here are the example vectors of variables and values that would be derived from the filename:
FileVars.vec <- c("Condition", "Plant")
FileInfo.vec <- c("aKG", "1")
Here is one way that I have solved how to do what I want:
for (i in 1:length(FileVars.vec)) {
Measurements[FileVars.vec[i]] <- FileInfo.vec[i]
}
Which gives the desired output:
X Length Condition Plant
1 130 aKG 1
2 150 aKG 1
3 170 aKG 1
4 140 aKG 1
But my (limited) understanding of R is that it is a vectorized language that often avoids the need for using for-loops. I feel like this simpler code should work:
Measurements[FileVars.vec] <- FileInfo.vec
But instead of assigning one value for one entire column, it recycles the values within each column:
X Length Condition Plant
1 130 aKG aKG
2 150 1 1
3 170 aKG aKG
4 140 1 1
Is there any way to do a similar simple assignment but without recycling, i.e. one value is assigned to one full column only? I imagine there's a simple formatting fix but I've searched for a solution for >6 hours and no where did I see an assignment like this. I have also thought of creating a separate dataframe of just the experimental conditions and then merging to the actual dataframe, but that seems more roundabout to me, especially with more experimental conditions and observations than these examples.
Also, if there is a more established pipeline/package for taking information from the filename and adding it to the data in a tidy fashion, that would be marvelous as well! The original filename would be something like:
"aKG_1.csv"
Thank you for helping an R noobie! May you receive good coding karma when debugging!
We can convert to a list and then assign to avoid the recycling of values column wise. As it is a list, each element will be treated as a unit and the assignment occurs for the respectively columns by recycling those elements
Measurements[FileVars.vec] <- as.list(FileInfo.vec)
-output
Measurements
# X Length Condition Plant
#1 1 130 aKG 1
#2 2 150 aKG 1
#3 3 170 aKG 1
#4 4 140 aKG 1
If we want to reset the type, use type.convert
Measurements <- type.convert(Measurements, as.is = TRUE)
Note that by creating a vector for FileInfo.vec, it will have a single type i.e. character. Instead if we want to have multiple types, it can be a list
Measurements[FileVars.vec] <- list("akg", 1)
For the second part of the question, if we have a string
str1 <- "aKG_1.csv"
and wants to create two columns from that, either use, read.csv or strsplit
Measurements[FileVars.vec] <- read.table(text = tools::file_path_sans_ext(str1),
sep="_", header = FALSE)

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

Make dataframe from a list of lists in R

I would like to make a dataframe from a list of n. Each list contains 3 different list inside. I am only intrested in 1 list of those 3 list inside. The list I am intrested in is a data.frame with 12 obs of 12 variables.
My imput tmp in my lapply function is a list of n with each 5 observations.
2 of those observations are the Latitude and Longitude. This is how my lapply function looks like:
DF_Google_Places<- lapply(tmp, function(tmp){
Latitude<-tmp$Latitude
Longitude<-tmp$Longitude
LatLon<- paste(Latitude,Longitude, sep=",")
res<-GET(paste("https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=",LatLon,"&radius=200&types=food&key=AIzaSyDS6usHdhdoGIj0ILtXJKCjuj7FBmDEzpM", sep=""))
jsonAnsw<-content(res,"text")
myDataframe<- jsonlite::fromJSON(content(res,"text"))
})
My question is: how do I get this list of 12 obs of 12 variables into a dataframe from a list of n ?
Could anyone help me out?, Thanks
I'm just posting my comment as an answer so I can show output to show you the idea:
x <- list(a=list(b=1,c=2),d=list(b=3,c=4))
So x is a nested list structure, in this case with consistent naming / structure one level down.
> x
$a
$a$b
[1] 1
$a$c
[1] 2
$d
$d$b
[1] 3
$d$c
[1] 4
Now we'll use do.call to build the data.frame. We need to pass it a named list of arguments, so we'll use list(sapply to get the named list. We'll walk the higher level of the list by position, and the inner level by name since the names are consistent across sub-lists at the inner level. Note here that the key idea is essentially to reverse what would be the intuitive way of indexing; since I want to pull observations at the second level from across observations at the first level, the inner call to sapply traverses multiple values of level one for each value of the name at level two.
y <- do.call(data.frame,
list(sapply(names(x[[1]]),
function(t) sapply(1:length(x),
function(j) x[[j]][[t]]))))
> y
b c
1 1 2
2 3 4
Try breaking apart the command to see what each step does. If there is any consistency in your sub-list structure, you should be able to adapt this approach to walk that structure in the right order and fold the data you need.
On a large dataset, this would not be efficient, but for 12x12 it should be fine.

R: for-loop solution to deleting columns from multiple data frames

My question is probably quite simple but I think my code could definitely be improved. Right now it's two for-loops but I'm sure there's a way to do what I need in a single loop, for the life of me I can't see what it is.
Having searched Stack, I found this excellent answer from Ananda where he was able to extract and keep columns within a range using lapply and for-loop methods. The structure of my data gets in the way, however, as I want to be able to pick specific columns to delete. My data structure looks like this:
1 AAAT_1 1 GROUP **** 1 -13.70 0
2 AAAT_2 51 GROUP **** 1 -9.21 0
3 AAAT_3 101 GROUP **** 1 -7.60 0
4 AAAT_4 151 GROUP **** 1 -6.28 0
It's extract from some docking software and the only columns I want to keep are 2 (e.g. AAAT_1) and 7 (e.g. -13.70). The code I've used to do it, two for-loops:
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[2:7])
}
....to keep the data from columns 2-7, followed by:
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[-2:-5])
}
....to delete the rest of the columns I didn't need, where temp[i] is just a list of data frames the loop is acting on.
So, as you can see, it's just two loops doing similar actions. Surely there's a way to be able to pick specific columns to keep/delete and do it all in one loop/lapply statement? Trying things like [2,7] in the get statement doesn't work, appears to keep only column 7 and turns each data frame into 'Values' instead. I'm not sure what's going so any insight there would be wonderful but, either way, if anyone can turn this two-loop solution into one, would be really appreciated. Definitely feel like I'm missing something really simple/obvious.
Cheers.
EDIT: Have taken into account the vectorised solutions from below to do the following instead. The names of raw imported data start with stuff like F0001, F0002, etc. hence the pattern to make the initial list.
lst <- mget(ls(pattern='^F\\d+'))
lst <- lapply(lst, "[", TRUE, c("V2","V7") )
lst <- lapply(seq_along(lst),
function(i,x) {assign(paste0(temp[i]),x[[i]], envir=.GlobalEnv)},
x=lst)
I know loops get a bad rap in R, was a natural solution to me as a CPP programmer but meh, this was far quicker. Initially, the only downside from the other example was that the assign command pasted a letter to each of the created tables in sequence 1,2,3,....,n when the list of raw imported data files weren't entirely in numerical order (i.e. 1,2,3,5,6,10,...etc.) so this didn't preserve that order. So I had to use a list of the files (our old friend temp) to name them correctly. Minor thing and the code isn't much shorter than two loops but it's most certainly faster.
So, in short, the above three lines add all the imported raw data to a list, keep only the columns I need then split the list up into separate dataframes whilst preserving the correct names. Cheers for the help!
If you have a data frame, you index rows and columns with
data.frame[row, column]
So, data.frame[2,7]) will give you the value of the 2nd row in the 7th column. I guess you were looking for
temp <- temp[, c(2,7)]
or, if temp is a list of data frames
temp <- lapply(temp, function(x) x[, c(2,7)])
So, if you want to use a vector of numbers as column- or row-indices, create this vector with c(...). If I understand your example right, you don't need any loop-command, if you use lapply.
A for loop? Maybe I'am missing something but just why do not use the solution proposed by #Daniel or a dplyr approach like this.
data
V1 V2 V3 V4 V5 V6 V7 V8
1 1 AAAT_1 1 GROUP **** 1 -13.70 0
2 2 AAAT_2 51 GROUP **** 1 -9.21 0
3 3 AAAT_3 101 GROUP **** 1 -7.60 0
4 4 AAAT_4 151 GROUP **** 1 -6.28 0
and here the code:
library(dplyr)
data <- select(data, V2, V7)
data
V2 V7
1 AAAT_1 -13.70
2 AAAT_2 -9.21
3 AAAT_3 -7.60
4 AAAT_4 -6.28

finding "almost" duplicates indices in a data table and calculate the delta

i have a smallish (2k) data set that contains questionnaire answers filled out by students there were sampled twice a year. not all the students that were present for the first wave were there for the second wave and vice versa. for each student, a unique id was created that consisted of the school code, the class code, the student number and the wave as a decimal point. for example 100612.1 is a student from school 10, grade 6, 12 on the names list and this was the first wave. the idea behind the decimal point was a way to identify the same student again in the data set (the only value which differs less than abs(1) from a given id is the same student on the other wave).at least that was the idea.
i was thinking of a script that would do the following:
- find the rows who's unique id is less than abs(1) from one another
- for those rows, generate a new row (in a new table) that consists of the student id and the delta of the measured variables( i.e value in the wave 2 - value in wave 1).
i a new to R but i have a tiny bit of background in other OOP. i thought about creating a for loop that runs from 1 to length(df) and just looks for it's "brother". my gut feeling tells me that this not the way things are done in R. any ideas?
all i need is a quick way of sifting through the data looking for the second wave row. i think the rest should be straight forward from there.
thank you for helping
PS. since this is my first post here i apologize beforehand for any wrongdoings in this post... :)
The question alludes to data.table, so here is a way to adapt #jed's answer using that package.
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
Example data as before, now instead of data.frame and tapply you can do this:
library(data.table)
surveyDT <- data.table(ids, answers)
surveyDT[, `:=` (child = substr(ids, 1, 6), wave = substr(ids, 8, 8))] # split ID's
# note multiple assign-by-reference := syntax above
setkey(surveyDT, child, wave) # order data
# calculate delta on keyed data, grouping by child
surveyDT[, delta := diff(answers), by = child]
unique(surveyDT[, delta, by = child]) # list results
child delta
1: 100612 -1
2: 100613 1
3: 110714 NA
4: 201802 NA
To remove rows with NA values for delta:
unique(surveyDT[, .SD[(!is.na(delta))], by = child])
child ids answers wave delta
1: 100612 100612.1 5 1 -1
2: 100613 100613.1 3 1 1
Use .SDcols to output only specific columns (in addition to the by columns), for example,
unique(surveyDT[, .SD[(!is.na(delta))], by = child, .SDcols = 'delta'])
child delta
1: 100612 -1
2: 100613 1
It took me some time to get acquainted with data.table syntax, but now I find it more intuitive, and it's fast for big data.
There are two ways that come to mind. The easiest is to use the function floor(), which returns the integer For example:
floor(100612.1)
#[1] 100612
floor(9.9)
#[1] 9
Alternatively, you could write a fairly simple regex expression to get rid of the decimal place too. Then you can use unique() to find the rows that are or are not duplicated entries.
Lets make some fake data so we can see our problem easily:
ids <- c(100612.1,100612.2,100613.1,100613.2,110714.1,201802.2)
answers <- c(5,4,3,4,1,0)
survey <- data.frame(ids,answers)
Now lets split our ids into two different columns:
survey$child_id <- substr(survey$ids,1,6)
survey$wave_id <- substr(survey$ids,8,8)
Then we'll order by child and wave, and compute differences based on child:
survey[order(survey$child_id, survey$wave_id),]
survey$delta <- unlist(tapply(survey$answers, survey$child_id, function(x) c(NA,diff(x))))
Output:
ids answers child_id wave_id delta
1 100612.1 5 100612 1 NA
2 100612.2 4 100612 2 -1
3 100613.1 3 100613 1 NA
4 100613.2 4 100613 2 1
5 110714.1 1 110714 1 NA
6 201802.2 0 201802 2 NA

Resources