How to do a complex search in a dataframe in R? - r

I have a data frame that has the 14 columns. 2 of those 14 columns are "Region" and "Population Density." Lets say that I want to find all instances when region is 4 and print out what the value of the population density is for each instance of region = 4.
here I am going to add a new column in the data frame called "PopDens"
this new column will take the total population and divided it by
the land region
cdi.df$PopDens= cdi.df$TotalPop/cdi.df$LandArea
dim(cdi.df) #here I verify the columns are now 14 and not 13
head(cdi.df) #here I verify that the name and calculations are correct
head(cdi.df$PopDens, 3) #here I return only the first three values
Is there a way to return only values of pop dens when region is =4?

Just like this:
cdi.df[cdi.df$Region==4,] #to see the data.frame with only region 4
cdi.df$PopDens[cdi.df$Region==4] #to see only the population densities
Next time, please provide a reproducible eaxmple, as explained here.

Related

How to split a heat map or a draw a boundary with specific range?

I have a expression matrix containing three groups. I need to draw or split the heat-map with specific range of column.
Total number of colums: 151 where 1st column is gene ids
Group1: 2:40
Group2: 41:80
Group3: 81:151
I searched for splitting the heatmap and I got some hits like this.
But they are based on specific clusters.
I need to give my range as (2:40, 41:80, 81:151) for splitting or making boundary for the heatmap
Something like this
library(pheatmap)
mat = cbind(genes=1:100,
matrix(rnorm(150*100,mean = rep(1:3,c(39*100,40*100,71*100))),ncol=150))
colnames(mat)[2:ncol(mat)] = paste0("col",1:150)
You need to know how many are in each group, from what you provided, i counted this:
Group1: 39 Group2: 40 Group3: 71
So you need to make a data.frame that has the same row names as your matrix, and tell it which is group1,2 etc.
DF = data.frame(Groups=rep(c("Group1","Group2","Group3"),c(39,40,71)))
rownames(DF) = colnames(mat)[2:ncol(mat)]
Then we plot, mat[,-1] means excluding the first column, you need to specify where to insert the gap, and for your example it is at 39,79 and 80 because we excluded the first column:
pheatmap(mat[,-1],cluster_cols=FALSE,
annotation_col=DF,gaps_col = cumsum(c(39,40,71)))

Split data frame based into ntiles based on value that is equal to sum of rows divided by the number of ntiles we want

I have a data frame with about 45k points with 3 columns - weight, persons and population. Population is weight*persons. I want to be able to split the data frame into ntiles(deciles, centiles etc) based on need. The data frame has to be split in a way that there are same number of population points in each ntile.
Which means, the data frame needs to be split at value = sum(population)/ntile. So for example if ntile = 10, then, sum(population)/10 = a. Next I need to add up row values in population column till sum = a, split at that point and continue this until I have run through all the 45K points. A sample of data is below.
weight persons population
1 3687.926 9 33191.337
2 3687.926 16 59006.8217
3 3687.926 7 25815.4847
4 4420.088 5 22100.447
5 4420.088 7 30940.6167
6 4420.088 6 26520.5287
7 3687.926 15 55318.8927
8 3687.926 9 33191.3357
9 3687.926 6 22127.5577
10 4452.829 8 35622.6367
11 4452.829 3 13358.4887
12 4452.829 4 17811.3187
I have been trying to use loops. I am stuck on splitting the data frame into the n splits needed. I an new to R. So any help is appreciated.
x= df$population
break_point = sum(x)/10
ntile_points = 0
for(i in 1:length(x))
{
while(ntile_points != break_point)
{
ntile_points = ntile_points+x[i]
}
}
I'm not sure that's what you want, note that your quantile is not necessary an integer, you should substract between each break point :
ntile=10
df=cbind(df,cumsum(df$population))
names(df)[ncol(df)]='Cumsum'
s=seq(0,sum(df$population),sum(df$population)/ntile)
subdfs=list()
for (i in 2:length(s)){
subdfs=c(subdfs,list(df[intersect(which(df$Cumsum<=s[i]),which(df$Cumsum>s[i-1])),]))
}
Then subdfs is a list which contains 10 data frames split as you wanted. Call the first data frame with subdfs[[1]] and so on. Maybe I did not understand what you want, tell me.
In this way the first df contain all the first values until the cumulate sum of the population stays in the interaval ]0,sum(population)/10], the second contains, the following values where the cumulate sum of the population is in the interval ]sum(population)/10,2*sum(population)/10], etc....
Is that what you wanted ?

Multiplying the number of counts in one column by a value in another to increase count

Seems like quite an easy problem to solve, but I can't seem to get my head around it in R.
I have dataset with the following columns:
'Biomass' where each row is a value of biomass for a particular species
'Count' where each row is the number of individual animals of that species counted
I need to create a histogram of biomasses, but if I use hist(DF$Biomass) I will get a histogram of the biomasses of the animals where each value is one animal.
I need to include the count, so that I have (for example) the weight frequencies of elephant x 2, giraffe x 56 etc..
you're not making my life easy :)
Is this what you want ?
DF <- data.frame(Biomass=c(200,200,1500),Count = c(36,20,2))
DF2 <- aggregate(Count ~ Biomass,DF,sum) # sum different occurrences for each Biomass value
barplot(DF2$Count,names.arg =DF2$Biomass) # presents them with a barplot, which is more appropriate than an histogram in the R sense here.
If I understood you right that is what you need :)
biomass<-c(1,5,7,6,3)
count<-c(1,2,1,3,4)
new<-NULL
for (i in 1:length(biomass))
{
new<-c(new, rep(biomass[i], count[i]))
}
new
hist(new)
So finally just type:
new<-NULL
for (i in 1:length(DF$Biomass))
{
new<-c(new, rep(DF$Biomass[i], DF$Count[i]))
}
hist(new)

Compute new column based on values in current and following rows with dplyr in R

I have a big dataset (10+ Mil x 30 vars) and i am trying to compute some new variables based on complicated interactions of current ones. For clarity i am including only the important variables in the question. I have the following code in R but i am interested in other views and opinions. I am using the dplyr package to compute new columns based on current/following row values of 3 other columns. (more explanation below code)
I am wondering if there is a way to make this faster and more efficient, or maybe completely rewrite it...
# the main function-data is a dataframe, windowSize and ratio are ints
computeNewColumn <- function(data,windowSize,ratio){
#helper function used in the second mutate down...
# all args are ints, i return a boolean out
windowAhead <- function(timeTo,window,reduction){
# subset the original dataframe-only observations with values of
# TimeToGo between timeTo-1 and window (basically the following X rows
# from the current one)
subframe <- data[(timeTo-1 >= data$TimeToGo & data$TimeToGo >= window), ]
isthere <- any(subframe$Price < reduction)
return(isthere)
}
# I group by value of ID first and order by TimeToGo...
data %<>% group_by(ID) %>%
arrange(desc(TimeToGo)) %>%
# ...create two new columns from simple interactions of existing ones...
mutate(Window = ifelse(TimeToGo > windowSize, TimeToGo - windowSize, 0),
Reduction = floor(Price - (ratio * Price))) %>%
rowwise() %>%
#...now comes the more complex stuff- I want to compute a third column
# depending on the next (TimeToGo - Window) number of values of Price
mutate(Advice = ifelse(windowAhead(TimeToGo,Window,Reduction),1,0) )
return(data)
}
We have a dataset with the following columns: ID,Price, TimeToGo.
We first group by values of ID and compute two new columns based on current row values (Window from TimeToGo and Reduction from Price). Next thing we would like to do is compute a new third column based on
1.current value of Reduction
2.the next (Window - TimeToGo) amount of values of Price in the dataframe.
I am wondering if there is a simple way to reference upcoming values of a column from within mutate()? I am ideally looking for a sliding window function on one column, where the limits of the sliding window are set from two other current column values. My solution for now just uses a custom function which subsets on the original dataframe manually, does a comparison and returns back a value to the mutate() call. Any help and ideas would be much appreciated!
p.s. heres a sample of data... please let me know if you would need any more info. Thanks!
> a
ID TimeToGo Price
1 AQSAFOTO30A 96 19
2 AQSAFOTO20A 95 19
3 AQSAFOTO30A 94 17
4 AQSAFOTO20A 93 18
5 AQSAFOTO25A 92 19
6 AQSAFOTO30A 91 17

Get the position of maximum value and the respective row element in a Data frame

I created a data frame named "data" and has 100 rows of names and corresponding ages (colnames "NAMES" and "AGES"). Now I try to find the maximum age using the max() function by using
max(data[,"AGES"])
I get the maximum age, but I want to get the position also and the names of the people having the maximum age. And after getting the names of the people of maximum age I want to arrange them alphabetically.. How do I do this?
I tried searching on the net, but wasnt successful in summing the different things up..
Let's first generate some demo data:
data<-data.frame(NAMES=replicate(100, paste(sample(letters, 8, replace=T), collapse="")), AGES=sample(20:60, 100, replace=T))
head(data)
NAMES AGES
1 oepefudt 21
2 ibmuaemm 49
3 mkockaqu 23
4 whyzomna 59
5 omqqtbsz 35
6 qnbmjmuf 25
We can then find the rows that have the maximum age, extract their names, and finally sort them in alphabetical order in a single line:
sort(as.character(data$NAMES[data$AGES==max(data$AGES)]))
Or maybe more transparently:
# Find the maximum age
max.age<-max(data$AGES)
# Which rows have the maximum age value?
ind<-which(data$AGES==max.age)
# Extract the name using the ind from above
persons<-as.character(data$NAMES[ind])
# Sort the names
persons.sorted<-sort(persons)
persons.sorted
Would this help?

Resources