Extract multiple data.frames from one with selection criteria - r

Let this be my data set:
df <- data.frame(x1 = runif(1000), x2 = runif(1000), x3 = runif(1000),
split = sample( c('SPLITMEHERE', 'OBS'), 1000, replace=TRUE, prob=c(0.04, 0.96) ))
So, I have some variables (in my case, 15), and criteria by which I want to split the data.frame into multiple data.frames.
My criteria is the following: each other time the 'SPLITMEHERE' appears I want to take all the values, or all 'OBS' below it and get a data.frame from just these observations. So, if there's 20 'SPLITMEHERE's in starting data.frame, I want to end up with 10 data.frames in the end.
I know it sounds confusing and like it doesn't have much sense, but this is the result from extracting the raw numbers from an awfully dirty .txt file to obtain meaningful data. Basically, every 'SPLITMEHERE' denotes the new table in this .txt file, but each county is divided into two tables, so I want one table (data.frame) for each county.
In the hope I will make it more clear, here is the example of exactly what I need. Let's say the first 20 observations are:
x1 x2 x3 split
1 0.307379064 0.400526799 0.2898194543 SPLITMEHERE
2 0.465236674 0.915204924 0.5168274657 OBS
3 0.063814420 0.110380201 0.9564822116 OBS
4 0.401881416 0.581895095 0.9443995396 OBS
5 0.495227871 0.054014926 0.9059893533 SPLITMEHERE
6 0.091463620 0.945452614 0.9677482590 OBS
7 0.876123151 0.702328031 0.9739113525 OBS
8 0.413120761 0.441159673 0.4725571219 OBS
9 0.117764512 0.390644966 0.3511555807 OBS
10 0.576699384 0.416279417 0.8961428872 OBS
11 0.854786077 0.164332814 0.1609375612 OBS
12 0.336853841 0.794020157 0.0647337821 SPLITMEHERE
13 0.122690541 0.700047133 0.9701538396 OBS
14 0.733926139 0.785366852 0.8938749305 OBS
15 0.520766503 0.616765349 0.5136788010 OBS
16 0.628549288 0.027319848 0.4509875809 OBS
17 0.944188977 0.913900539 0.3767973795 OBS
18 0.723421337 0.446724318 0.0925365961 OBS
19 0.758001243 0.530991725 0.3916394396 SPLITMEHERE
20 0.888036748 0.862066601 0.6501050976 OBS
What I would like to get is this:
data.frame1:
1 0.465236674 0.915204924 0.5168274657 OBS
2 0.063814420 0.110380201 0.9564822116 OBS
3 0.401881416 0.581895095 0.9443995396 OBS
4 0.091463620 0.945452614 0.9677482590 OBS
5 0.876123151 0.702328031 0.9739113525 OBS
6 0.413120761 0.441159673 0.4725571219 OBS
7 0.117764512 0.390644966 0.3511555807 OBS
8 0.576699384 0.416279417 0.8961428872 OBS
9 0.854786077 0.164332814 0.1609375612 OBS
And
data.frame2:
1 0.122690541 0.700047133 0.9701538396 OBS
2 0.733926139 0.785366852 0.8938749305 OBS
3 0.520766503 0.616765349 0.5136788010 OBS
4 0.628549288 0.027319848 0.4509875809 OBS
5 0.944188977 0.913900539 0.3767973795 OBS
6 0.723421337 0.446724318 0.0925365961 OBS
7 0.888036748 0.862066601 0.6501050976 OBS
Therefore, split column only shows me where to split, data in columns where 'SPLITMEHERE' is written is meaningless. But, this is no bother, as I can delete this rows later, the point is in separating multiple data.frames based on this criteria.
Obviously, just the split() function and filter() from dplyr wouldn't suffice here. The real problem is that the lines which are supposed to separate the data.frames (i.e. every other 'SPLITMEHERE') do not appear in regular fashion, but just like in my above example. Once there is a gap of 3 lines, and other times it could be 10 or 15 lines.
Is there any way to extract this efficiently in R?

The hardest part of the problem is creating the groups. Once we have the proper groupings, it's easy enough to use a split to get your result.
With that said, you can use a cumsum for the groups. Here I divide the cumsum by 2 and use a ceiling so that any groups of 2 SPLITMEHERE's will be collapsed into one. I also use an ifelse to exclude the rows with SPLITMEHERE:
df$group <- ifelse(df$split != "SPLITMEHERE", ceiling(cumsum(df$split=="SPLITMEHERE")/2), 0)
res <- split(df, df$group)
The result is a list with a dataframe for each group. The groups with 0 are ones you want throw out.

Related

Construct dataframe columns based on a function of other columns in R

I am trying to figure out a way to do this in R, ideally with something in the apply() family of functions (i.e. not with a for loop).
I want to use a function based on four other columns in my data frame and I want to save the results of that function in three new columns of the data frame.
For example if I have (with test data):
x <- c("var1","var2","var3","var4")
A_x <- c(5,4,3,2)
A_notx <- c(5,6,7,8)
B_x <- c(10,10,5,15)
B_notx <- c(10,10,15,5)
example <- data.frame(A_x,A_notx,B_x,B_notx)
rownames(example) <- x
A_x A_notx B_x B_notx
var1 5 5 10 10
var2 4 6 10 10
var3 3 7 5 15
var4 2 8 15 5
And I want to use oddsratio() from the epitools library on these counts, how could I save the odds ratio as well as the upper and lower bounds as 3 new columns? I would like example$odds, example$upper, and example$lower to exist in my dataframe.
I have messed around a bit with apply() and and by() but can't seem to figure it out. With apply it changes the structure of the row from data frame to matrix, and it is outside of the scope of the function to set column values within the function. Perhaps this whole thing is better served by a list object than a data frame? In the end I want to have all the information on hand (counts, statistics, etc.) for a given variable name, with a variable in each column.
Perhaps this is what you're looking for?
example <- cbind(example,
t(apply(example,1,function(x){
oddsratio(as.table(rbind(x[1:2],x[3:4])))$measure[2,]
}
)))
example
A_x A_notx B_x B_notx estimate lower upper
var1 5 5 10 10 0.99999998 0.20537812 4.8690679
var2 4 6 10 10 0.68116864 0.13043731 3.2586139
var3 3 7 5 15 1.28836297 0.20019246 7.2563905
var4 2 8 15 5 0.09603445 0.01039156 0.5446693

R code to iteratively and randomly delete entire rows from a data frame based on a column value, and saving as a new data frame each time

Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2

Split data frame based into ntiles based on value that is equal to sum of rows divided by the number of ntiles we want

I have a data frame with about 45k points with 3 columns - weight, persons and population. Population is weight*persons. I want to be able to split the data frame into ntiles(deciles, centiles etc) based on need. The data frame has to be split in a way that there are same number of population points in each ntile.
Which means, the data frame needs to be split at value = sum(population)/ntile. So for example if ntile = 10, then, sum(population)/10 = a. Next I need to add up row values in population column till sum = a, split at that point and continue this until I have run through all the 45K points. A sample of data is below.
weight persons population
1 3687.926 9 33191.337
2 3687.926 16 59006.8217
3 3687.926 7 25815.4847
4 4420.088 5 22100.447
5 4420.088 7 30940.6167
6 4420.088 6 26520.5287
7 3687.926 15 55318.8927
8 3687.926 9 33191.3357
9 3687.926 6 22127.5577
10 4452.829 8 35622.6367
11 4452.829 3 13358.4887
12 4452.829 4 17811.3187
I have been trying to use loops. I am stuck on splitting the data frame into the n splits needed. I an new to R. So any help is appreciated.
x= df$population
break_point = sum(x)/10
ntile_points = 0
for(i in 1:length(x))
{
while(ntile_points != break_point)
{
ntile_points = ntile_points+x[i]
}
}
I'm not sure that's what you want, note that your quantile is not necessary an integer, you should substract between each break point :
ntile=10
df=cbind(df,cumsum(df$population))
names(df)[ncol(df)]='Cumsum'
s=seq(0,sum(df$population),sum(df$population)/ntile)
subdfs=list()
for (i in 2:length(s)){
subdfs=c(subdfs,list(df[intersect(which(df$Cumsum<=s[i]),which(df$Cumsum>s[i-1])),]))
}
Then subdfs is a list which contains 10 data frames split as you wanted. Call the first data frame with subdfs[[1]] and so on. Maybe I did not understand what you want, tell me.
In this way the first df contain all the first values until the cumulate sum of the population stays in the interaval ]0,sum(population)/10], the second contains, the following values where the cumulate sum of the population is in the interval ]sum(population)/10,2*sum(population)/10], etc....
Is that what you wanted ?

How to group rows in r based on row number

I would like to make 2 groups based on their row numbers (the 1st group being rows 2 to 47, and the 2nd group being rows 48 to 92). The groups are my top and bottom performing samples and I would like to compare the groups' values in the 12 data columns (genes being tested). So, my ultimate goal is to divide the samples into their appropriate groups, and run statistical analyses the group's values for each of the genes tested. Here is a small section of my table:
Sample icaA icaB icaC icaD
ST1 12 13 15 18
ST2 11 9 8 16
ST3 15 18 18 15
ST4 13 16 17 20
I don't know if I can use cbind to combine the groups. I think I've also seen others flip the rows and columns; I can do that if needed. I'm just a beginner with the software, so any suggestions would be great!
To get the first group:
df1 <- df[2:47, ]
To get the second group:
df2 <- df[48:92, ]
Right?
Then you can run stats on each column, for instance, like this:
apply(df1[-1], 2, mean)
...To get the mean for each column, in the first group.
Then for the mean of each column in the second group:
apply(df2[-1], 2, mean)
Then to bind each group into 1 dataframe (or matrix) again, then I recommend:
rbind(df1, df2)

How to calculate differences in column values based on value markers (ex.1 or 2) in a different column of the same .csv file in R?

I have a .csv file with several columns, but I am only interested in two of the columns(TIME and USER). The USER column consists of the value markers 1 or 2 in chunks and the TIME column consists of a value in seconds. I want to calculate the difference between the TIME value of the first 2 in a chunk in the USER column and the first 1 in a chunk in the USER column. I want to accomplish this through R. It would be ideal for their to be another column added to my data file with these differences.
So far I have only imported the .csv into R.
Latency <- read.csv("/Users/alinazjoo/Documents/Latency_allgaze.csv")
I'm going to guess your data looks like this
# sample data
set.seed(15)
rr<-sample(1:4, 10, replace=T)
dd<-data.frame(
user=rep(1:5, each=10),
marker=rep(rep(1:2,10), c(rbind(rr, 5-rr))),
time=1:50
)
Then you can calculate the difference using the base function aggregate and transform. Observe
namin<-function(...) min(..., na.rm=T)
dx<-transform(aggregate(
cbind(m2=ifelse(marker==2,time,NA), m1=ifelse(marker==1, time,NA)) ~ user,
dd, namin, na.action=na.pass),
diff = m2-m1)
dx
# user m2 m1 diff
# 1 1 4 1 3
# 2 2 15 11 4
# 3 3 23 21 2
# 4 4 35 31 4
# 5 5 44 41 3
We use aggregate to find the minimal time for each of the two kinds or markers, then we use transform to calculate the difference between them.

Resources