This is a simple problem but I have not found an explicit solution in the archives. Say I have a matrix m:
m <- structure(c(2, 0, 1, 1, 0, 2, 2, 2, 2, 1, 0, 2, 2, 2, 1, 2, 0,
1, 0, 1, 0, 2, 2, 0, 1, 2, 2, 1, 2, 0, 2, 0, 1, 0, 2, 1, 2, 1,
0, 1, 0, 2, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 2, 2, 1, 1, 1,
0, 2, 2, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 0, 26, 18, 26, 18,
22, 21, 13, 22, 27, 20, 27, 24, 18, 21, 18, 22, 16, 22, 19, 15,
22, 27, 20, 20, 17), .Dim = c(25L, 4L), .Dimnames = list(NULL,
c("r", "s", "t", "u")))
And want to take a subset of the matrix containing the vector of some values in column u:
vec <- c(20, 21, 22, 24, 26)
In other words select the rows containing those values. Suggestions on how to do that or a link to the solution?
You could use which() and %in% but you can use directly only %in% (Many thanks and the credit for #GKi):
#Code
newmat <- m[m[,'u'] %in% vec,]
Output:
r s t u
[1,] 2 2 0 26
[2,] 1 1 0 26
[3,] 0 0 2 22
[4,] 2 2 2 21
[5,] 2 1 1 22
[6,] 1 2 0 20
[7,] 2 2 2 24
[8,] 2 0 1 21
[9,] 2 0 1 22
[10,] 1 0 1 22
[11,] 0 1 0 22
[12,] 2 0 0 20
[13,] 0 0 2 20
Related
Please see my code below:
# functions to get percentile threshold, and assign new values to outliers
get_low_perc <- function(var_name) {
return(quantile(var_name, c(0.01)))
}
get_hi_perc <- function(var_name) {
return(quantile(var_name, c(0.99)))
}
round_up <- function(target_var, flag_var, floor) {
target_var <- as.numeric(ifelse(flag_var == 1, floor, target_var))
return(as.integer(target_var))
}
round_down <- function(target_var, flag_var, ceiling) {
target_var <- as.numeric(ifelse(flag_var == 1, ceiling, target_var))
return(as.integer(target_var))
}
# try putting it all together
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- get_low_perc(df_col_name)
hi_perc <- get_hi_perc(df_col_name)
df$df_col_flagH <- as.factor(ifelse(df_col_name < lo_perc, 1, 0))
df$df_col_flagL <- as.factor(ifelse(df_col_name > hi_perc, 1, 0))
df_col_name <- round_up(df_col_name, df_col_flagL, lo_perc)
df_col_name <- round_down(df_col_name, df_col_flagH, hi_perc)
# names(df)[names(df)=='df_col_flagH'] <-
# boxplot(df_col_name)
return(df)
}
I have created 5 custom functions; the first two respectively get the 1th percentile and the 99th percentile of a given variable. The last two round the values in these variables up or down depending on how far away they are from the 1st percentile and the 99th percentile values. The last function is trying to put all these functions together to essentially output a new dataframe containing the same columns in the original df, the updated column, and two new columns indicating values that were flagged as below the 1st percentile and above the 99th percentile. I have produced a mock dataframe below, since I can't seem to pass some of my data here.
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77),
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102))
Ideally, after I call the function using the command "no_way(df2, df2$col1, df2$new_col1, df2$new_col2)", I want an output dataframe looking like:
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77), # updated with appropriate values
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102),
new_col1 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0),
new_col2 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0))
^ Where new_col1 and new_col2 are column names given by the user when calling the function. I am currently getting the dataframe as expected, but the new columns created have kept the function parameters' names, as in:
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77), # updated with appropriate values
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102),
df_col_flagH = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0),
df_col_flagL = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0))
I would not mind changing the name of the columns afterwards, but I will be using this function of 17 columns therefore that wouldn't be optimal. Please help.
You should pass new column names as string.
Also ifelse(condition, 1, 0) can be simplified to as.integer(condition).
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- get_low_perc(df[[df_col_name]])
hi_perc <- get_hi_perc(df[[df_col_name]])
df[[df_col_flagH]] <- as.factor(as.integer(df[[df_col_name]] < lo_perc))
df[[df_col_flagL]] <- as.factor(as.integer(df[[df_col_name]] > hi_perc))
df[[df_col_name]] <- round_up(df[[df_col_name]], df_col_flagL, lo_perc)
df[[df_col_name]] <- round_down(df[[df_col_name]], df_col_flagH, hi_perc)
return(df)
}
df2 <- no_way(df2, "col1", "new_col1", "new_col2")
df2
# col col1 col2 new_col1 new_col2
#1 1 9 8 0 0
#2 3 9 2 0 0
#3 4 9 8 0 0
#4 5 9 4 1 0
#5 8 9 87 0 0
#6 7 9 66 0 0
#7 67 9 54 0 0
#8 744 9 99 0 0
#9 876 9 77 0 0
#10 8 9 77 0 1
#11 8 9 88 0 0
#12 54 9 67 0 0
#13 9 9 102 0 0
I got two data sets of different lengths. I want to create a new column in the dataset which got more rows based on filtering a specific column from the shorter df. I am getting a waring " Longer object length is not a multiple of shorter object length". And the result is also not correct. I tried to created a smaller example datasets and tried the same code and its working with correct results. I am not sure why on my original data the results are not correct and I am getting the warning.
The example datasets are
structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
the second df
structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
I tried this on both my original datasets where I get the warning and these dummy dfs where no warning and correct results.
data2 <- data2 %>%
mutate(d_Activity = ifelse(code2 %in% data1$code & activity == 1, 1,0))
Actually, you are doing it wrong way. Let me explain-
In sample data it is working because larger df have rows (20) which is multiple of rows in smaller df (10).
So in you syntax what you are doing is, to check one complete vector with another complete vector (column of another df), because R normally works in vectorised way of operations.
the correct way of matching one to many is through purrr::map where each individual value in first argument (code2 here) operates with another vector i.e. df1$code which is not in argument of map.
df1 <- structure(list(id = 1:10, activity = c(0, 0, 0, 0, 1, 0, 0, 1,
0, 0), code = c(2, 5, 11, 15, 3, 18, 21, 3, 27, 55)), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(id2 = 1:20, code2 = c(2, 5, 11, 15, 9, 18, 21,
3, 27, 55, 2, 5, 11, 15, 3, 18, 21, 3, 27, 55), d_Activity = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0)), class = "data.frame", row.names = c(NA,
-20L))
library(tidyverse)
df2 %>%
mutate(d_Activity = map(code2, ~ +(.x %in% df1$code[df1$activity == 1])))
#> id2 code2 d_Activity
#> 1 1 2 0
#> 2 2 5 0
#> 3 3 11 0
#> 4 4 15 0
#> 5 5 9 0
#> 6 6 18 0
#> 7 7 21 0
#> 8 8 3 1
#> 9 9 27 0
#> 10 10 55 0
#> 11 11 2 0
#> 12 12 5 0
#> 13 13 11 0
#> 14 14 15 0
#> 15 15 3 1
#> 16 16 18 0
#> 17 17 21 0
#> 18 18 3 1
#> 19 19 27 0
#> 20 20 55 0
Created on 2021-06-17 by the reprex package (v2.0.0)
I have two data frames:
> df1
2013-04-1 2013-04-2 2013-04-3 2013-04-4 2013-04-5 2013-04-6 2013-04-7 2013-04-8 2013-04-9 2013-04-10 2013-04-11
bin_1 32 489 32 32 364 19 312 0 0 0 346
bin_2 8 346 8 0 98 8 12 12 46 364 346
bin_3 9 98 346 46 9 312 6 1912 0 489 0
bin_4 4 12 9 12 0 12 0 987 9 19 12
bin_5 0 0 8 8 0 0 312 6 312 12 4
df1 contains 5 rows (bins) and 23 columns (date)
> df2
orange apple pear banana watermelon lemon
2013-04-1 1 1 1 1 0 1
2013-04-2 1 1 0 1 0 0
2013-04-3 1 1 1 1 0 1
2013-04-4 0 1 0 1 1 1
2013-04-5 1 0 0 0 1 1
df2 contains 23 rows(date) and 6 columns (types of fruits)
So now, I want to concentrate these 2 dfs into 1 big data frame that contains all the information, like:
> df3
orange apple pear banana watermelon lemon
bin_1 ? ? ? ? ? ?
bin_2 ? ? ? ? ? ?
bin_3 ? ? ? ? ? ?
bin_4 ? ? ? ? ? ?
bin_5 ? ? ? ? ? ?
But how can i concentrate the data? So for example,
on 2013-04-1,
bin_1 contains 32 fruits, bin_2 contains 8 fruits, ..., bin_5 contains 0 fruits (based on df1)
only orange, apple, pear, banana, and lemon are available (based on df2)
Q. I want my df3 to contain concentrate information, like bin_1 on average contain x amount of oranges, ...etc .How can I model this?
Code:
> dput(df1)
structure(list(`2013-04-1` = c(32, 8, 9, 4, 0), `2013-04-2` = c(489,
346, 98, 12, 0), `2013-04-3` = c(32, 8, 346, 9, 8), `2013-04-4` = c(32,
0, 46, 12, 8), `2013-04-5` = c(364, 98, 9, 0, 0), `2013-04-6` = c(19,
8, 312, 12, 0), `2013-04-7` = c(312, 12, 6, 0, 312), `2013-04-8` = c(0,
12, 1912, 987, 6), `2013-04-9` = c(0, 46, 0, 9, 312), `2013-04-10` = c(0,
364, 489, 19, 12), `2013-04-11` = c(346, 346, 0, 12, 4), `2013-04-12` = c(0,
9, 12, 46, 489), `2013-04-13` = c(32, 8, 19, 46, 0), `2013-04-14` = c(0,
987, 12, 0, 6), `2013-04-15` = c(0, 346, 4, 346, 0), `2013-04-16` = c(0,
1912, 1912, 12, 364), `2013-04-17` = c(12, 98, 32, 32, 1912),
`2013-04-18` = c(12, 12, 12, 0, 346), `2013-04-19` = c(9,
46, 98, 312, 4), `2013-04-20` = c(32, 987, 46, 9, 312), `2013-04-21` = c(4,
98, 12, 32, 12), `2013-04-22` = c(19, 0, 4, 346, 0), `2013-04-23` = c(1912,
364, 0, 0, 489)), row.names = c("bin_1", "bin_2", "bin_3",
"bin_4", "bin_5"), class = "data.frame")
> dput(df2)
structure(list(orange = c(1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0), apple = c(1, 1, 1, 1, 0, 1,
0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0), pear = c(1,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
0), banana = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1, 0, 1, 0), watermelon = c(0, 0, 0, 1, 1, 0, 1, 1,
1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0), lemon = c(1, 0,
1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0
)), row.names = c("2013-04-1", "2013-04-2", "2013-04-3", "2013-04-4",
"2013-04-5", "2013-04-6", "2013-04-7", "2013-04-8", "2013-04-9",
"2013-04-10", "2013-04-11", "2013-04-12", "2013-04-13", "2013-04-14",
"2013-04-15", "2013-04-16", "2013-04-17", "2013-04-18", "2013-04-19",
"2013-04-20", "2013-04-21", "2013-04-22", "2013-04-23"), class = "data.frame")
I'm trying to compare to matrices. When the values aren't equivalent then I want to use the value from mat2 so long as it is greater than 0; if it is zero, then I want the value from mat1. As the code is currently, it appears to constantly return the value of mat1.
Here is my attempt:
mat.data1 <- c(1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1)
mat1 <- matrix(data = mat.data1, nrow = 5, ncol = 5, byrow = TRUE)
mat.data2 <- c(0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 2, 0, 0, 0, 1, 2, 2, 0, 2, 1, 0, 1)
mat2 <- matrix(data = mat.data2, nrow = 5, ncol = 5, byrow = TRUE)
mat3 = if(mat1 == mat2){mat1} else {if(mat2>0){mat2} else {mat1}}
the expected output should be
1 0 1 1 1
0 1 2 1 1
1 1 2 2 0
1 1 1 2 2
1 1 1 0 1
Here is one potential way to do it.
mat.data1 <- c(1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1)
mat1 <- matrix(data = mat.data1, nrow = 5, ncol = 5, byrow = TRUE)
mat.data2 <- c(0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 2, 0, 0, 0, 1, 2, 2, 0, 2, 1, 0, 1)
mat2 <- matrix(data = mat.data2, nrow = 5, ncol = 5, byrow = TRUE)
mat3 <- mat1
to_change <- which(mat2 != mat1 & mat2 > 0)
mat3[to_change] <- mat2[to_change]
This specific use of which essentially asks for the locations in mat2 that are not equal to that in mat1 AND where mat2 is greater than zero. You can then just do a subset and place those values in mat3.
This output is then:
> mat3
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 1 1 1
[2,] 0 1 2 1 1
[3,] 1 1 2 2 0
[4,] 1 1 1 2 2
[5,] 1 2 1 0 1
We can use coalesce
library(dplyr)
out <- coalesce(replace(mat2, !mat2, NA), replace(mat1, !mat1, NA))
replace(out, is.na(out), 0)
Or as #Axeman mentioned
coalesce(out, 0)
I have a data frame set up like the one below (plot vs species occurrence data).
df=data.frame(plot=c(1, 2, 3, 4, 5, 6, 7, 8, 9), speciesA=c(5, 0, 10, 0, 8, 45, 0, 0, 17), speciesB = c(0, 0, 0, 0, 0, 0, 0, 0, 0), speciesC = c(0.7, 0, 17, 0, 0, 8, 0, 9, 0), species D = c(1, 0, 0, 3, 0, 0, 0, 9, 1))
I need to be able to create a second data frame (or subset this one) that contains only species that occur in greater than 4 plots. I used colSums to sount the number of occurances > 0 for each column, but cannot apply that to filtering the data frame.
colSums(df != 0)
df2 <- df[,which(apply(df,2,colSums)> 4)]
Any suggestions?
How about this...
df2 <- df[,colSums(df>0)>4]
df2
plot speciesA
1 1 5
2 2 0
3 3 10
4 4 0
5 5 8
6 6 45
7 7 0
8 8 0
9 9 17