R function finding dataframe rows with specific attributes - r

I currently have a large dataset for which I want to find total time spent at altitude and range of temperatures experienced.
An example dataset is provided:
time<-c(1,2,3,4,5,6,7,8,9,10)
height<-c(10,33,41,57,20,27,23,39,40,42)
temp<-c(37,33,14,12,35,34,32,28,26,24)
practicedf<-data.frame(time,height,temp)
I want to calculate the total time spent above 30 m (height) and range of temperatures experienced at these altitudes. However, in my actual dataset the sampling frequency has resulted in a series of datapoints that skip over 30 m (i.e. going from 28.001 to 32.02 and never actually stopping at 30). Therefore I wanted to create a code that documented all of the dataframe rows that are below 30 m and also each time there is a gap between dataframe rows greater than one (to account for times when the data is above 30 m and then returns below 30 m, i.e. 27.24, 32.7, 45.002, 28.54) so I know to discount all points above the altitude I am targeting.
I've created the following function to carry this portion of my analysis out (pinpointing dataframe rows below 30 m).
pracfunction<-function(h){
res<-as.vector(lapply(h,function(x) if (x<=30) {1} else {0}))
res1<-as.vector(which(res == 1))
res_new<-list()
for (item in 1:length(res1)){
ifelse((res1[i+1]-res1[i]>1), append(res_new,i),
append(res_new,"na"))
}
print(which(res_new != "na"))
}
I want the output to look like:
[1] 1 5 6 7
Since in the vector height, indices 1, 5, 6, and 7 have values less than 30.
However each time I run it with height as the input I receive integer(0) as the output. I'm pretty new at writing loops and functions so if anyone could provide input into what I'm doing wrong, or has a better way to approach this problem it would be greatly appreciated! Thank you.

I'd use dplyr to create a new column low indicating whether height < 30.
library(dplyr)
practicedf <- practicedf %>%
mutate(low = ifelse(height < 30, 1, 0))
time height temp low
1 1 10 37 1
2 2 33 33 0
3 3 41 14 0
4 4 57 12 0
5 5 20 35 1
6 6 27 34 1
7 7 23 32 1
8 8 39 28 0
9 9 40 26 0
10 10 42 24 0

Not sure whether I understand your intentions correctly but here is what I think you might be looking for. Start with an extended sample data.frame:
pd <- structure(list(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), height = c(10, 33, 41, 57, 20,
27, 23, 39, 40, 42, 10, 33, 28, 17, 20, 27, 23, 39, 40, 42),
temp = c(37, 33, 14, 12, 35, 34, 32, 28, 26, 24, 37, 33,
14, 12, 35, 34, 32, 28, 26, 24)), .Names = c("time", "height",
"temp"), row.names = c(NA, -20L), class = "data.frame")
Then this function gives you the index of intercepts in a way that the value of every crossing the 30m line in either direction is given. I guess that's not exactly what you want but you can take it from here.
pf <- function( x ) # x is the data.frame
{
res <- ifelse( x[ , "height" ] <= 30, 1 , 0 ) # simplified version of your attempt
n <- NULL # initiate the index vector
for( i in 1:( length( res ) - 1 ) ) # to accommodate room for comparison
{
if( res[ i + 1 ] != res[ i ] ) # registers change between 0 and 1
n <- append( n, i + 1 ) # and writes it into the index vector
}
return( n )
}
With this, the call
pf( pd )
returns
[1] 2 5 8 11 12 13 18
indicating the positions on the height vector after the height limit of 30m was crossed, in either direction.

Related

Examine if a value is in an interval using R

Having the following vector:
t <- c(2, 6, 8, 20, 22, 30, 40, 45, 60)
I would like to find the values that fall between the following intervals:
g <- list(c(1,20), c(20, 40))
The desired output is:
1, 20 c(2, 6, 8)
20, 40 c(20, 22, 30)
Using the dplyr library, I do the following:
library(dplyr)
for(i in t){
for(h in g){
if(between(i, h[[1]], h[[2]])==TRUE){print(c(i, h[[1]], h[[2]]))}
}}
Is there a better way of doing this in R?
We can loop over the list 'g' and extract the 't' elements based on the first and second values by creating a logical vector with >/< and extract the elements of 't'
lapply(g, function(x) t[t >= x[1] & t < x[2]])
-output
[[1]]
[1] 2 6 8
[[2]]
[1] 20 22 30
library(purrr)
library(dplyr)
map(g,~keep(t,between(t,.[1],.[2])))
[[1]]
[1] 2 6 8 20
[[2]]
[1] 20 22 30 40
You may find findInterval() from base R useful:
g <- c(1, 20, 40)
t <- c(2, 6, 8, 20, 22, 30, 40, 45, 60)
findInterval(t, g)
#> [1] 1 1 1 2 2 2 3 3 3
So t[1], t[2] and t[3] are in the first interval, t[4], t[5] and
t[6] in the second, and t[7], t[8] and t[9] the third (meaning that
these values are bigger than the right end point of the second interval.)
If you had values lower than one they would be labelled by 0:
t2 <- c(-1, 0, 2, 6, 8, 20, 22, 30, 40, 45, 60)
findInterval(t2, g)
#> [1] 0 0 1 1 1 2 2 2 3 3 3
You can save the result of findInterval() as e.g. y and use which(y==1) to find which entries correspond to the first interval.
We can try cut + is.na like below
lapply(
g,
function(x) {
t[!is.na(cut(t, x, include.lowest = TRUE))]
}
)
which gives
[[1]]
[1] 2 6 8 20
[[2]]
[1] 20 22 30 40

Replicate excel solver in R

I am wondering if there is a simple function to solve the following problem in R:
Suppose I have the following dataframe:
Variable 'A' with values c(10, 35, 90)
Variable 'B' with values c(3, 4, 17, 18, 50, 40, 3)
Now I know that the sum of various values in B equal the values in A, e.g. '3 + 4 + 3 = 10' and '17 + 18 = 35', which always balances out in the complete dataset.
Question
Is there a function that can sum these values in B, through trial and error I suppose, and match the correctly summed values with A? For example, the function tries to sum 3 + 4 + 18, which is 25 and retries this because 25 is not a value in A.
I have tried several solutions myself but one problem that I often encountered was the fact that A always has less observations than B.
I would be very thankful if someone can help me out here! If more info is needed please let me know.
Cheers,
Daan
Edit
This example is with simplified numbers. In reality, it is a large dataset, so I am looking for a scalable solution.
Thanks again!
This is a problem know as the subset sum problem, and there are a ton of examples online of how to solve it using dynamic programming, or greedy algorithms.
To give you an answer that just works, the package adagio has an implementation:
library(adagio)
sums = c(10, 35, 90)
values = c(3, 4, 17, 18, 50, 40, 3)
for(i in sums){
#we have to subset the values to be less than the value
#otherwise the function errors:
print(subsetsum(values[values < i], i))
}
The output for each sum is a list, with the val and the indices in the array, so you can tidy up the output depending on what you want from there.
You can try the following but I am affraid is not scalable.
For the case of 3 summands you have
x <- expand.grid(c(3, 4, 17, 18, 50, 40, 3),#building a matrix of the possible combinations of summands
c(3, 4, 17, 18, 50, 40, 3),
c(3, 4, 17, 18, 50, 40, 3))
x$sums <-rowSums(x) #new column with possible sums
idx<- x$sums%in%c(10, 35, 90) #checking the sums are in the required total
x[idx,]
Var1 Var2 Var3 sums
2 4 3 3 10
8 3 4 3 10
14 3 4 3 10
44 4 3 3 10
50 3 3 4 10
56 3 3 4 10
92 3 3 4 10
98 3 3 4 10
296 4 3 3 10
302 3 4 3 10
308 3 4 3 10
338 4 3 3 10
For the case of 2 summands
x <- expand.grid(c(3, 4, 17, 18, 50, 40, 3),
c(3, 4, 17, 18, 50, 40,3))
x$sums <-rowSums(x)
idx<- x$sums%in%c(10, 35, 90)
#Results
x[idx,]
Var1 Var2 sums
18 18 17 35
24 17 18 35
34 40 50 90
40 50 40 90

Formula to substitute dataframe column names with categories defined in a second dataframe

Let's say I have data in wide format (samples in row and species in columns).
species <- data.frame(
Sample = 1:10,
Lobvar = c(21, 15, 12, 11, 32, 42, 54, 10, 1, 2),
Limtru = c(2, 5, 1, 0, 2, 22, 3, 0, 1, 2),
Pocele = c(3, 52, 11, 30, 22, 22, 23, 10, 21, 32),
Genmes = c(1, 0, 22, 1, 2,32, 2, 0, 1, 2)
)
And I want to automatically change the species names, based on a reference of functional groups that I have for all of the species (so it works even if I have more references than actual species in the dataset), for example:
reference <- data.frame(
Species_name = c("Lobvar", "Ampmis", "Pocele", "Genmes", "Limtru", "Secgio", "Nasval", "Letgos", "Salnes", "Verbes"),
Functional_group = c("Crustose", "Geniculate", "Erect", "CCA", "CCA", "CCA", "Geniculate", "Turf","Turf", "Crustose"),
stringsAsFactors = FALSE
)
EDIT
Thanks to #Dan Y suggestions, I can now changes the species names to their functional group names:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
However, in my actual data.frame I have more species, and this creates many functional groups with the same name in different columns. I now would like to sum the columns that have the same names. I updated the example to give a results in which there is more than one functional group with the same name.
So i get this:
Sample Crustose CCA Erect CCA Crustose
1 21 2 3 1 2
2 15 5 52 0 3
3 12 1 11 22 4
4 11 0 30 1 1
5 32 2 22 2 0
6 42 22 22 32 0
and the final result I am looking for is this:
Sample Crustose CCA Erect
1 23 3 3
2 18 5 52
3 16 22 11
4 12 1 30
5 32 4 22
6 42 54 22
How do you advise on approaching this? Thanks for your help and the amazing suggestions I already received.
Re Q1) We can use match to do the name lookup:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
Re Q2) Then we can mapply the rowSums function after some regular expression work on the colnames:
namevec <- gsub("\\.[[:digit:]]", "", names(df))
mapply(function(x) rowSums(df[which(namevec == x)]), unique(namevec))

Subset dataframe with equal difference for one column in R

I am trying to iterate the rows in a dataframe (data) to check if one of the columns (data$ID) has similar difference (e.g., 3) between consecutive elements. If yes, keep the row, otherwise remove the row. The tricky part is I need to re-compare consecutive elements after certain row is removed.
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
data
ID score
1 3.1 70
2 6 80
3 6.9 90
4 9 65
5 10.5 43
6 12 78
7 14.2 44
8 15 92
for (i in (length(data$ID)-1)) {
first <- data$ID[i]
second <- data$ID[i+1]
if ((second-first) == 3){
data <- data[-(i+1),]
}
}
The expected output data should be
ID score
1 3.1 70
2 6 80
3 9 65
4 12 78
5 15 92
The initial row 3, 5, 7 are excluded due to the different diff. But my code failed.
I also try to use diff function,
DF <- diff(data)
But it doesn't take care the fact that after one row is removed, the difference will change. Should I use diff function in a loop, but the dataframe is dynamic changed.
Using a recursive function (a function that calls itself)
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
# use recursive function to trim the remainder of the list
trim_ids <- function (ids) {
# if only one element, return it
if (length(ids) <= 1) {
return(ids)
}
# if the gap between element 2 and element 1 is small enough
if ((ids[2] - ids[1]) < 2.9 ) {
# trim after dropping the second element
return(trim_ids(ids[-2]))
} else {
# keep the first element and trim from the second element
return(c(ids[1], trim_ids(ids[2:length(ids)] )))
}
}
# find the ids to keep
keep_ids <- trim_ids(data$ID)
# select the matching rows
data[data$ID %in% keep_ids,]
# ID score
# 1 3.1 70
# 2 6.0 80
# 4 9.0 65
# 6 12.0 78
# 8 15.0 92
An option could be achieved using cumsum and diff as:
#data
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
data[c(0, cumsum(diff(round(data$ID))) %% 3 ) == 0,]
# ID score
# 1 3.1 70
# 2 6.0 80
# 4 9.0 65
# 6 12.0 78
# 8 15.0 92
If you define you want to keep all rows of which the ID, when rounded to 0 digits, belongs to a product of 3, you could try:
df1 <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
df1[round(df1$ID) %% 3 == 0,]
ID score
1 3.1 70
2 6.0 80
4 9.0 65
6 12.0 78
8 15.0 92

Data analysis by R language. How to discribe the distribution of NA positon in a vector?

I hope the position distribution of NA is uniform in the vector (length = 30, NA < 6 ).
This one length is 30, 4 NA. It's easy to see these NA not uniform, mainly at left.
vector_x <- c(NA,3, NA, 1, NA, 5, 6, 7, 7, 9, 0, 2, 12, 324, 54,23, 12, 324, 122, 23, 324, 332, 45, 78, 32, 12, 342, 95, 67, NA)
But I have no idea about use which kind of statistic or test to discribe. Then I can quantitative screening by a cutoff.
Now, I have two preliminary thoughts.
To simplify the solution, all NA seemed as 0 and all number seemed as 1, to see the distribution.
Or I get the index of NA, to do variance analysis about c(1, 3, 5, 30)
Thanks for your any suggestions!
You want to perform a Mann-Whitney U test or Wilcoxon rank-sum test (which is more descriptive of what it's doing)
This is easy to do with your data
which(is.na(v))
# [1] 1 3 5 30
which(!is.na(v))
# [1] 2 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
wilcox.test(which(is.na(v)), which(!is.na(v)))
# Wilcoxon rank sum test
# data: which(is.na(v)) and which(!is.na(v))
# W = 29, p-value = 0.1766
# alternative hypothesis: true location shift is not equal to 0
Check that wilcox.test works the way we expect with
wilcox.test(1:5, 6:10) # low p value
wilcox.test(seq(1,10,2), seq(2,10,2)) # high p value
If we need the index of NA elements, use is.na to convert to a logical vector, then with which returns the numeric index where it is TRUE
which(is.na(vector_x))
#[1] 1 3 5 30
Or to convert to a binary vector where 0 represents NA and 1 for other values
as.integer(!is.na(vector_x))

Resources