Select rows in data frame with same values - r

I have a dataframe with unique values $Number identifying specific points where a polygon is intersecting. Some points (i.e. 56) have 3 polygons that intersect. I want to extract the three rows which start with 56.
df <- cbind(Number = rownames(check), check)
df
df table
The issue going forward is I will be applying this for 10,000 points and won't know the repeating number such as "56". So is there a way to have a general expression which chooses rows with a general match without knowing that value?

You can achieve the desired output with:
subset2 <- function(n) df[floor(df$Number) == n,]
where df is the name of your dataset and Number is the name of the target column. We can fill in n as needed:
#Example
df <- data.frame(Number=c(1,3,24,56.65,56.99,56.14,66),y=sample(LETTERS,7))
df
# Number y
# 1 1.00 J
# 2 3.00 B
# 3 24.00 D
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
# 7 66.00 V
subset2(56)
# Number y
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H

I simply changed the $Number column into a numeric field, then rounded down to integer data.
numeric <- as.numeric(as.character(df$Number))
Id <- floor(numeric)

If we only want $Number with more than 3 counts then we can use dplyr to group by $Number and then retain $Number if it has more than 3 counts
library(dplyr)
# Data
df <- data.frame(Number = c(1,1,1,2,2,3,3))
# Filtering
df %>% group_by(Number) %>% filter(n() >= 3)

Related

How can I apply the decile cuts from one dataframe to another using R

I have a dataframe (df1) and have calculated the deciles for each row using the following:
#create a function to calculate the deciles
decilefun <- function(x) as.integer(cut(x, unique(quantile(x, probs=0:10/10)), include.lowest=TRUE))
# convert df1 to matrix
mat1 <- as.matrix(df1)
#apply the function I created above to calculate deciles
df1_deciles <- apply(mat1, 1, decilefun)
#add the rownames back in
rownames(df1_deciles) <- row.names(df1)
#convert to dataframe
df1_deciles <- as.data.frame(df1_deciles)
str(df1_deciles) # to show what the data looks like
#'data.frame': 157 obs. of 3321 variables:
# $ Variable1 : int 10 10 4 4 5 8 8 8 6 3 ...
# $ Variable2 : int 8 3 9 7 2 8 9 5 8 2 ...
# $ Variable3 : int 8 4 7 7 2 9 10 3 8 3 ...
I have another dataframe (df2) with the same rownames (Variable1, Variable2,etc...) but different number of columns.
I would like to use the same decile cuts which were used for df1 on this second dataframe but I'm not sure how to do it. I am actually not even sure how to determine/export what the cuts where on the original data which resulted on the df1_deciles dataframe I created. What I mean by this is, how do I export an object which tells me what range of values for Variable1 on df1 were assigned to a decile value = 1 or a decile value = 2, and so on.
I do not want to use the 'decilefun' function I created on df2, but instead want to use the variability and range information from df1.
This is my first question on the platform so I hope it is clear and I hope I have provided enough information. I have tried to find answers on the platform but have not found one. I appreciate any help on this.
Using data.table:
##
# create an artificial dataset with the structure you describe
#
set.seed(1)
df1 <- data.frame(Variable.1=rnorm(1000), variable.2=runif(1000), variable.3=rgamma(1000, scale=10, shape=5))
df1 <- t(df1)
##
#
df2 <- data.frame(Variable.1=rnorm(1000, -1), variable.2=runif(1000), variable.3=rgamma(1000, scale=20, shape=5))
df2 <- t(df2)
##
# you start here
# assumes df1 and df2 have structure described in problem
# data in rows, not columns
#
library(data.table)
df1 <- as.data.table(t(df1)) # transpose: put data in columns
brks <- lapply(df1, quantile, probs=(0:10)/10, labels=FALSE) # list of deciles for each row in df1
df2 <- as.data.table(df2, keep.rownames = TRUE) # keep df2 data in rows: 1000 columns here
result <- df2[ # this does all the work
, .(value= unlist(.SD),
decile=cut(unlist(.SD), breaks=c(-Inf, brks[[rn]], +Inf), labels=c('below', names(brks[[rn]])[2:11], 'above'))
)
, by=.(rn)]
result[, .N, keyby=.(rn, decile)] # validate that result is reasonable
Applying deciles from one dataset to another has the nuance the some values in the new dataset might be outside the range of the original data. The test data here demonstrates this problem. Variable.1 in df2 has values lower than any in df1, and variable.3 in df2 has values larger than any in df1.

Filtering/subsetting R dataframe based on each rows n'th position value

I have a 'df' with 2 columns:
Combinations <- c(0011111111, 0011113111, 0013113112, 0022223114)
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)
I am trying to find a way to subset or filter the dataframe where the 'Combinations' column's 7th, 8th, and 9th digits equal 311. For the example given, I would expect Combination's 0011113111, 0013113112, 0022223114
There are also instances where I would need to find different combinations, in different nth positions.
I know substring() can find these values for single rows but I'm not sure how to apply it to an entire dataframe.
subtring will work with vectors as well.
subset(df, substring(Combinations, 7, 9) == 311)
# Combinations Values
#2 0011113111 2
#3 0013113112 3
#4 0022223114 4
data
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
Another base R idea:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
df[grep(pattern = "^[0-9]{6}311.$", df$Combinations), ]
Output:
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
As a tip, if you want to know more about regular expressions, this website helps me a lot: https://regexr.com/3elkd
Would this work?
library(dplyr)
library(stringr)
df %>% filter(str_sub(Combinations, 7,9) == 311)
Combinations Values
1 0011113111 2
2 0013113112 3
3 0022223114 4
Not pretty but works:
df[which(lapply(strsplit(df$Combinations, ""), function(x) which(x[7]==3 & x[8]==1 & x[9]==1))==1),]
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
Data:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)

How to check if pairs from df2 are in pairs of df1 (inclusive) in R?

I have a two dataframes, where I want to compare pairs of dataframe b, to the pairs of dataframe a, and see if the pairs from b fall within (inclusive) the pairs/range of those in a. For instance, see below:
df_1 <- data.frame(x= c(-82.38319, -82.38318, -82.40397, -82.40417, -82.40423),
y= c(29.61212, 29.61125, 29.61130, 29.61134, 29.61167))
#Output:
# x y
# 1 -82.38319 29.61212
# 2 -82.38318 29.61125
# 3 -82.40397 29.61130
# 4 -82.40417 29.61134
# 5 -82.40423 29.61167
df_2 <- data.frame(o= c(-82.38320,-82.38317,-82.40397,-82.40416,-82.40424),
t= c(29.61212, 29.6114, 29.61130, 29.61133, 29.61167))
#Output:
# o t
# 1 -82.38320 29.61212
# 2 -82.38317 29.61140
# 3 -82.40397 29.61130
# 4 -82.40416 29.61133
# 5 -82.40424 29.61167
#made this dataframe as an example only.
desired_output <- data.frame(lat= df_2$o, lon= df_2$t, exists= c(NA, "YES","YES","YES",NA))
#Output I seek:
# lat lon exists
# 1 -82.38320 29.61212 <NA>
# 2 -82.38317 29.61140 YES
# 3 -82.40397 29.61130 YES
# 4 -82.40416 29.61133 YES
# 5 -82.40424 29.61167 <NA>
#explanation:
#1- even though 82.38320 is OK & is in rows 3,4,5 in df_1, 29.61212 is out of bounds with their co-pairings.
#2- row 2 of df_2 is within the row 5 of df_1.
#3- row 3 of df_2 matches to row 3 of df_1 thus inclusive
#4- row 4 pair matches and its co_pair is less than those pair of row 4 in df_1
#5- This pair at row 5 is out of bounds in all of the rows of df_1
#Column "exists" can be appended to dataframe b, result matters only, neatness is not an issue.
I have done digging around in Stack Overflow, got nothing but this listing. But this person was comparing single value with pairs, not pairs to pairs or pairs within pairs. I have done cbind to both dataframe and compare using that. But I failed with that.
What can I try next?
We can use mapply to compare o and t values of df_2 with df_1 and check if any value is the range and assign "YES" or NA accordingly.
df_2$exists <- c(NA, "YES")[mapply(function(x, y)
any(df_1$x <= x & df_1$y >= y), df_2$o, df_2$t) + 1]
df_2
# o t exists
#1 -82.38320 29.61212 <NA>
#2 -82.38317 29.61140 YES
#3 -82.40397 29.61130 YES
#4 -82.40416 29.61133 YES
#5 -82.40424 29.61167 <NA>
We can use a non-equi join in data.table
library(data.table)
setDT(df_2)[df_1, exists := "YES", on = .(o >= x, t < y), mult = 'first']

How to sum every nth (200) observation in a data frame using R [duplicate]

This question already has answers here:
calculating mean for every n values from a vector
(3 answers)
Closed 4 years ago.
I am new to R so any help is greatly appreciated!
I have a data frame of 278800 observations for each of my 10 variables, I am trying to create an 11th variable that sums every 200 observations (or rows) of a specific variable/column (sum(1:200, 201:399, 400:599 etc.) Similar to the offset function in excel.
I have tried subsetting my data to just the variable of interest with the aim of adding a new variable that continuously sums every 200 rows however I cannot figure it out. I understand my new "variable" will produce 1,394 data points (278,800/200). I have tried to use the rollapply function, however the output does not sum in blocks of 200, it sums 1:200, 2:201, 3:202 etc.)
Thanks,
E
rollapply has a by= argument for that. Here is a smaller example using n = 3 instead of n = 200. Note that 1+2+3=6, 4+5+6=15, 7+8+9=24 and 10+11+12=33.
# test data
DF <- data.frame(x = 1:12)
library(zoo)
n <- 3
rollapply(DF$x, n, sum, by = n)
## [1] 6 15 24 33
First let's generate some data and get a label for each group:
library(tidyverse)
df <-
rnorm(1000) %>%
as_tibble() %>%
mutate(grp = floor(1 + (row_number() - 1) / 200))
> df
# A tibble: 1,000 x 2
value grp
<dbl> <dbl>
1 -1.06 1
2 0.668 1
3 -2.02 1
4 1.21 1
...
1000 0.78 5
This creates 1000 random N(0,1) variables, turns it into a data frame, and then adds an incrementing numeric label for each group of 200.
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value))
# A tibble: 5 x 2
grp grp_sum
<dbl> <dbl>
1 1 9.63
2 2 -12.8
3 3 -18.8
4 4 -8.93
5 5 -25.9
Then we just need to do a group-by operation on the second column and sum the values. You can use the pull() operation to get a vector of the results:
df %>%
group_by(grp) %>%
summarize(grp_sum = sum(value)) %>%
pull(grp_sum)
[1] 9.62529 -12.75193 -18.81967 -8.93466 -25.90523
I created a vector with 278800 observations (a)
a<- rnorm(278800)
b<-NULL #initializing the column of interest
j<-1
for (i in seq(1,length(a),by=200)){
b[j]<-sum(a[i:i+199]) #b is your column of interest
j<-j+1
}
View(b)

Sorting a column in descending order in R excluding the first row

I have a dataframe with 5 columns and a very large dataset. I want to sort by column 3. How do you sort everything after the first row? (When calling this function I want to end it with nrows)
Example output:
Original:
4
7
9
6
8
New:
4
9
8
7
6
Thanks!
If I'm correctly understanding what you want to do, this approach should work:
z <- data.frame(x1 = seq(10), x2 = rep(c(2,3), 5), x3 = seq(14, 23))
zsub <- z[2:nrow(z),]
zsub <- zsub[order(-zsub[,3]),]
znew <- rbind(z[1,], zsub)
Basically, snip off the rows you want to sort, sort them in descending order on column 3, then reattach the first row.
And here's a piped version using dplyr, so you don't clutter the workspace with extra objects:
library(dplyr)
z <- z %>%
slice(2:nrow(z)) %>%
arrange(-x3) %>%
rbind(slice(z, 1), .)
You might try this single line of code to modify the third column in your data frame df as described:
df[,3] <- c(df[1,3],sort(df[-1,3]))
df$x[-1] <- df$x[-1][order(df$x[-1], decreasing=T)]
# x
# 1 4
# 2 9
# 3 8
# 4 7
# 5 6

Resources