It seems pretty basic... but I'm trying to generate a second array in R that would correspond to the counts of the events of my primary array. For instance, if there are 14 Age[x] that are 42, I would want Age.count[x] to equal 14.
So if Age was [1] 10 14 14 13 14 12 10 I would want my Age.count to be [1] 2 3 3 1 3 1 2. It seems like it should be really simple but I haven't managed yet...
My best shot so far:
for (val in length(Age)) {
Age.count[val] <- length(subset(Age, Age==val2))
}
Unfortunately it's giving me NA values on all but the first and last values. Help?
A simple way is to use ave, i.e.,
> ave(age,age,FUN = length)
[1] 2 3 3 1 3 1 2
DATA
age <- c(10, 14, 14, 13, 14, 12, 10)
You could make it more compact than this, but at least you can see what is happening this way.
age = c(10, 14, 14, 13, 14, 12, 10)
counts = table(age)
i = match(age, names(counts))
counts[i]
> counts[i]
age
10 14 14 13 14 12 10
2 3 3 1 3 1 2
Let
Age = c(10, 14, 14, 13, 14, 12, 10)
X = data.frame(Age)
Test = as.data.frame(table(X))
Test$X = as.numeric(as.character(Test$X))
colnames(Test) = c("Age", "Frequency")
Then
Result = dplyr::inner_join(X, Test)
will work.
Related
I am wondering if there is a simple function to solve the following problem in R:
Suppose I have the following dataframe:
Variable 'A' with values c(10, 35, 90)
Variable 'B' with values c(3, 4, 17, 18, 50, 40, 3)
Now I know that the sum of various values in B equal the values in A, e.g. '3 + 4 + 3 = 10' and '17 + 18 = 35', which always balances out in the complete dataset.
Question
Is there a function that can sum these values in B, through trial and error I suppose, and match the correctly summed values with A? For example, the function tries to sum 3 + 4 + 18, which is 25 and retries this because 25 is not a value in A.
I have tried several solutions myself but one problem that I often encountered was the fact that A always has less observations than B.
I would be very thankful if someone can help me out here! If more info is needed please let me know.
Cheers,
Daan
Edit
This example is with simplified numbers. In reality, it is a large dataset, so I am looking for a scalable solution.
Thanks again!
This is a problem know as the subset sum problem, and there are a ton of examples online of how to solve it using dynamic programming, or greedy algorithms.
To give you an answer that just works, the package adagio has an implementation:
library(adagio)
sums = c(10, 35, 90)
values = c(3, 4, 17, 18, 50, 40, 3)
for(i in sums){
#we have to subset the values to be less than the value
#otherwise the function errors:
print(subsetsum(values[values < i], i))
}
The output for each sum is a list, with the val and the indices in the array, so you can tidy up the output depending on what you want from there.
You can try the following but I am affraid is not scalable.
For the case of 3 summands you have
x <- expand.grid(c(3, 4, 17, 18, 50, 40, 3),#building a matrix of the possible combinations of summands
c(3, 4, 17, 18, 50, 40, 3),
c(3, 4, 17, 18, 50, 40, 3))
x$sums <-rowSums(x) #new column with possible sums
idx<- x$sums%in%c(10, 35, 90) #checking the sums are in the required total
x[idx,]
Var1 Var2 Var3 sums
2 4 3 3 10
8 3 4 3 10
14 3 4 3 10
44 4 3 3 10
50 3 3 4 10
56 3 3 4 10
92 3 3 4 10
98 3 3 4 10
296 4 3 3 10
302 3 4 3 10
308 3 4 3 10
338 4 3 3 10
For the case of 2 summands
x <- expand.grid(c(3, 4, 17, 18, 50, 40, 3),
c(3, 4, 17, 18, 50, 40,3))
x$sums <-rowSums(x)
idx<- x$sums%in%c(10, 35, 90)
#Results
x[idx,]
Var1 Var2 sums
18 18 17 35
24 17 18 35
34 40 50 90
40 50 40 90
Let's say I have data in wide format (samples in row and species in columns).
species <- data.frame(
Sample = 1:10,
Lobvar = c(21, 15, 12, 11, 32, 42, 54, 10, 1, 2),
Limtru = c(2, 5, 1, 0, 2, 22, 3, 0, 1, 2),
Pocele = c(3, 52, 11, 30, 22, 22, 23, 10, 21, 32),
Genmes = c(1, 0, 22, 1, 2,32, 2, 0, 1, 2)
)
And I want to automatically change the species names, based on a reference of functional groups that I have for all of the species (so it works even if I have more references than actual species in the dataset), for example:
reference <- data.frame(
Species_name = c("Lobvar", "Ampmis", "Pocele", "Genmes", "Limtru", "Secgio", "Nasval", "Letgos", "Salnes", "Verbes"),
Functional_group = c("Crustose", "Geniculate", "Erect", "CCA", "CCA", "CCA", "Geniculate", "Turf","Turf", "Crustose"),
stringsAsFactors = FALSE
)
EDIT
Thanks to #Dan Y suggestions, I can now changes the species names to their functional group names:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
However, in my actual data.frame I have more species, and this creates many functional groups with the same name in different columns. I now would like to sum the columns that have the same names. I updated the example to give a results in which there is more than one functional group with the same name.
So i get this:
Sample Crustose CCA Erect CCA Crustose
1 21 2 3 1 2
2 15 5 52 0 3
3 12 1 11 22 4
4 11 0 30 1 1
5 32 2 22 2 0
6 42 22 22 32 0
and the final result I am looking for is this:
Sample Crustose CCA Erect
1 23 3 3
2 18 5 52
3 16 22 11
4 12 1 30
5 32 4 22
6 42 54 22
How do you advise on approaching this? Thanks for your help and the amazing suggestions I already received.
Re Q1) We can use match to do the name lookup:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
Re Q2) Then we can mapply the rowSums function after some regular expression work on the colnames:
namevec <- gsub("\\.[[:digit:]]", "", names(df))
mapply(function(x) rowSums(df[which(namevec == x)]), unique(namevec))
I apologize for the poor phrasing of this question, I am still a beginner in R and I am still getting used to the proper terminology. I have provided sample data below:
mydata <- data.frame(x = c(1, 2, 7, 19, 45), y=c(10, 12, 15, 19, 24))
View(mydata)
My intention is to find the x speed, and for this I would need to find the difference between 1 and 2, 2 and 7, 7 and 19, and so on. How would I do this?
You can use the diff function.
> diffs <- as.data.frame(diff(as.matrix(mydata)))
> diffs
x y
1 1 2
2 5 3
3 12 4
4 26 5
> mean(diffs$x)
[1] 11
You can use dplyr::lead() and dplyr::lag() depending on how you want the calculations to line up
library(dplyr)
mydata <- data.frame(x = c(1, 2, 7, 19, 45), y=c(10, 12, 15, 19, 24))
View(mydata)
mydata %>%
mutate(x_speed_diff_lead = lead(x) - x
, x_speed_diff_lag = x - lag(x))
# x y x_speed_diff_lead x_speed_diff_lag
# 1 1 10 1 NA
# 2 2 12 5 1
# 3 7 15 12 5
# 4 19 19 26 12
# 5 45 24 NA 26
I currently have a large dataset for which I want to find total time spent at altitude and range of temperatures experienced.
An example dataset is provided:
time<-c(1,2,3,4,5,6,7,8,9,10)
height<-c(10,33,41,57,20,27,23,39,40,42)
temp<-c(37,33,14,12,35,34,32,28,26,24)
practicedf<-data.frame(time,height,temp)
I want to calculate the total time spent above 30 m (height) and range of temperatures experienced at these altitudes. However, in my actual dataset the sampling frequency has resulted in a series of datapoints that skip over 30 m (i.e. going from 28.001 to 32.02 and never actually stopping at 30). Therefore I wanted to create a code that documented all of the dataframe rows that are below 30 m and also each time there is a gap between dataframe rows greater than one (to account for times when the data is above 30 m and then returns below 30 m, i.e. 27.24, 32.7, 45.002, 28.54) so I know to discount all points above the altitude I am targeting.
I've created the following function to carry this portion of my analysis out (pinpointing dataframe rows below 30 m).
pracfunction<-function(h){
res<-as.vector(lapply(h,function(x) if (x<=30) {1} else {0}))
res1<-as.vector(which(res == 1))
res_new<-list()
for (item in 1:length(res1)){
ifelse((res1[i+1]-res1[i]>1), append(res_new,i),
append(res_new,"na"))
}
print(which(res_new != "na"))
}
I want the output to look like:
[1] 1 5 6 7
Since in the vector height, indices 1, 5, 6, and 7 have values less than 30.
However each time I run it with height as the input I receive integer(0) as the output. I'm pretty new at writing loops and functions so if anyone could provide input into what I'm doing wrong, or has a better way to approach this problem it would be greatly appreciated! Thank you.
I'd use dplyr to create a new column low indicating whether height < 30.
library(dplyr)
practicedf <- practicedf %>%
mutate(low = ifelse(height < 30, 1, 0))
time height temp low
1 1 10 37 1
2 2 33 33 0
3 3 41 14 0
4 4 57 12 0
5 5 20 35 1
6 6 27 34 1
7 7 23 32 1
8 8 39 28 0
9 9 40 26 0
10 10 42 24 0
Not sure whether I understand your intentions correctly but here is what I think you might be looking for. Start with an extended sample data.frame:
pd <- structure(list(time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), height = c(10, 33, 41, 57, 20,
27, 23, 39, 40, 42, 10, 33, 28, 17, 20, 27, 23, 39, 40, 42),
temp = c(37, 33, 14, 12, 35, 34, 32, 28, 26, 24, 37, 33,
14, 12, 35, 34, 32, 28, 26, 24)), .Names = c("time", "height",
"temp"), row.names = c(NA, -20L), class = "data.frame")
Then this function gives you the index of intercepts in a way that the value of every crossing the 30m line in either direction is given. I guess that's not exactly what you want but you can take it from here.
pf <- function( x ) # x is the data.frame
{
res <- ifelse( x[ , "height" ] <= 30, 1 , 0 ) # simplified version of your attempt
n <- NULL # initiate the index vector
for( i in 1:( length( res ) - 1 ) ) # to accommodate room for comparison
{
if( res[ i + 1 ] != res[ i ] ) # registers change between 0 and 1
n <- append( n, i + 1 ) # and writes it into the index vector
}
return( n )
}
With this, the call
pf( pd )
returns
[1] 2 5 8 11 12 13 18
indicating the positions on the height vector after the height limit of 30m was crossed, in either direction.
Given a vector of numbers, I'd like to map each to the smallest in a separate vector that the number does not exceed. For example:
# Given these
v1 <- 1:10
v2 <- c(2, 5, 11)
# I'd like to return
result <- c(2, 2, 5, 5, 5, 11, 11, 11, 11, 11)
Try
cut(v1, c(0, v2), labels = v2)
[1] 2 2 5 5 5 11 11 11 11 11
Levels: 2 5 11
which can be converted to a numeric vector using as.numeric(as.character(...)).
Another way (Thanks for the edit #Ananda)
v2[findInterval(v1, v2 + 1) + 1]
# [1] 2 2 5 5 5 11 11 11 11 11]