How to replace NULL/? with 'None' or '0' in r - r

DF1 is
ID CompareID Distance
1 256 0
1 834 0
1 946 0
2 629 0
2 735 1
2 108 1
Expected output should be DF2 as below (Condition for generating DF2 -> In DF1, For any ID if 'Distance'==1, put the corresponding 'CompareID' into 'SimilarID' column, for 'Distance'==0, ignore the corresponding 'CompareID')
ID SimilarID
1 None
2 735,108
Comparison is done correctly , but i got below output
ID SimilarID
1 ?
2 735,108
I understood that, as there are no 'CompareID' to put in 'SimilarID' - ? mark is displayed.
I want to replace this '?' with 'None' or '0'. Kindly help
In some cases, i observe that instead of '?' i can also see 'NULL' value.
Thanks !

Using the data.table package, where df is your original data ...
library(data.table)
setDT(df)[, .(SimilarID = if(all(Distance == 0)) "None"
else toString(CompareID[Distance == 1])), by = ID]
# ID SimilarID
# 1: 1 None
# 2: 2 735, 108
This follows your expected output by returning, by ID
"None" when all of the Distance column is zero
the CompareID values for when Distance is 1, as a comma-delimited string
Data:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), CompareID = c(256L,
834L, 946L, 629L, 735L, 108L), Distance = c(0L, 0L, 0L, 0L, 1L,
1L)), .Names = c("ID", "CompareID", "Distance"), class = "data.frame", row.names = c(NA,
-6L))

Try the following using dplyr:
summarise.func <- function (Distance,CompareID) {
SimilarID <- CompareID[Distance == 1]
if (length(SimilarID)==0) "None" else paste0(SimilarID, collapse=",")
}
library(dplyr)
df2 <- df1 %>% group_by(ID) %>%
summarise(SimilarID=summarise.func(Distance,CompareID))
First, define a summarizing function summarise.func that:
Extracts the CompareID to a SimilarID vector if the Distance == 1.
If this SimilarID vector has elements, then return a string that are these CompareIDs collapsed with ","; otherwise return "None".
Then, use this summarise.func to summarise SimilarID grouped by ID.
Using your data:
print(df2)
### A tibble: 2 x 2
## ID SimilarID
## <int> <chr>
##1 1 None
##2 2 735,108

Using aggregate in base R:
df2 <- aggregate((CompareID*Distance)~ID, df, FUN=function(x)
ifelse(sum(x)>0, paste(x[x>0], collapse = ","), "None"))
names(df2) <- c("ID", "SimilarID") #if necessary
# ID SimilarID
#1 1 None
#2 2 735,108
CompareID*Distance ensures that CompareID will be ignored if Distance==0. Further, its grouped by ID, if sum is greater than 0, the non-zero values (x[x>0]) are comma-delimited and None, otherwise.

Related

cumsum with a condition to restart in R

I have this dataset containing multiple columns. I want to use cumsum() on a column conditioning the sum on another column. That is when X happens I want the sum to restart from zero but, I want to sum also the number of the "x" event row. I'll be more precise here in an example.
inv ass port G cumsum(G)
i x 2 1 1
i x 2 0 1
i x 0 1 2
i x 3 0 0
i x 3 1 1
So in the 3rd row the condition port == 0 happens. I want to cumsum(G), but on the 3rd row i want to still sum the value of G and to restart the count from the following row.
I'm using dplyr to group_by(investor, asset) but I'm stuck here since I'm doing:
res1 <- res %>%
group_by(investor, asset) %>%
mutate(posdays = ifelse(operation < 0 & portfolio == 0, 0, cumsum(G))) %>%
ungroup()
Since this restart the cumsum() but excludes the sum of the 3rd row.
I think something saying "cumsum(G) but when condition "x" in the previous row, restart the sum in the following row".
Can you help me?
You may use cumsum to create groups as well.
library(dplyr)
df <- df %>%
group_by(group = cumsum(dplyr::lag(port == 0, default = 0))) %>%
mutate(cumsum_G = cumsum(G)) %>%
ungroup
df
# inv ass port G group cumsum_G
# <chr> <chr> <int> <int> <dbl> <int>
#1 i x 2 1 0 1
#2 i x 2 0 0 1
#3 i x 0 1 0 2
#4 i x 3 0 1 0
#5 i x 3 1 1 1
You may remove the group column from output using %>% select(-group).
data
df <- structure(list(inv = c("i", "i", "i", "i", "i"), ass = c("x",
"x", "x", "x", "x"), port = c(2L, 2L, 0L, 3L, 3L), G = c(1L,
0L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))

Getting column-wise means and standard deviations for positive and negative values separately in R

I can get column-wise means and standard deviations (sample) of a dataframe as follows:
means <- apply(df , 2, mean)
sdevs <- apply(df , 2, sd)
However, my dataframe contains positive and negative values and I need to get means and standard deviation for positive and negative values separately
Example Input:
COL1 COL2
1 1
2 1
3 1
-1 -1
-5 -1
-9 -1
Example Output:
positive_means = [2,1]
positive_sdevs = [1,0]
negative_means = [-5,-1]
negative_sdevs = [4,0]
I do not want to build a for loop because my data frame contain too much values and columns.
Thanks.
You can try this creating a group for positive and negative values and then summarise with dplyr functions:
library(dplyr)
#Code
new <- df %>% mutate(group=ifelse(COL1>0&COL2>0,'Pos','Neg')) %>%
group_by(group) %>% summarise_all(list('mean'=mean,'sd'=sd))
Output:
# A tibble: 2 x 5
group COL1_mean COL2_mean COL1_sd COL2_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 Neg -5 -1 4 0
2 Pos 2 1 1 0
Some data used:
#Data
df <- structure(list(COL1 = c(1L, 2L, 3L, -1L, -5L, -9L), COL2 = c(1L,
1L, 1L, -1L, -1L, -1L)), class = "data.frame", row.names = c(NA,
-6L))
Another option can be using apply() and rowSums():
#Code1
as.data.frame(apply(df[rowSums(df)>0,],2,function(x) {data.frame(Mean=mean(x),SD=sd(x))}))
Output:
COL1.Mean COL1.SD COL2.Mean COL2.SD
1 2 1 1 0
#Code2
as.data.frame(apply(df[!rowSums(df)>0,],2,function(x) {data.frame(Mean=mean(x),SD=sd(x))}))
Output:
COL1.Mean COL1.SD COL2.Mean COL2.SD
1 -5 4 -1 0
Here's another base R option to add to Duck's excellent answer:
as.data.frame(lapply(df, function(x) c(mean_pos = mean(x[x > 0]),
mean_neg = mean(x[x <= 0]),
sd_pos = sd(x[x > 0 ]),
sd_neg = sd(x[x <= 0]))))
#> COL1 COL2
#> mean_pos 2 1
#> mean_neg -5 -1
#> sd_pos 1 0
#> sd_neg 4 0

Using a vector as a grep pattern

I am new to R. I am trying to search the columns using grep multiple times within an apply loop. I use grep to specify which rows are summed based on the vector individuals
individuals <-c("ID1","ID2".....n)
bcdata_total <- sapply(individuals, function(x) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
bcdata is of random size and contains random data but contains columns that have individuals in part of the string
>head(bcdata)
ID1-4 ID1-3 ID2-5
A 3 2 1
B 2 2 3
C 4 5 5
grep(individuals[1],colnames(bcdata_clean)) returns a vector that looks like
[1] 1 2, a list of the column names containing ID1. That vector is used to select columns to be summed in bcdata_clean. This should occur n number of times depending on the length of individuals
However this returns the error
In grep(individuals, colnames(bcdata)) :
argument 'pattern' has length > 1 and only the first element will be used
And results in all the columns of bcdata being identical
Ideally individuals would increment each time the function is run like this for each iteration
apply(bcdata_clean[,grep(individuals[1,2....n], colnames(bcdata_clean))], 1, sum)
and would result in something like this
>head(bcdata_total)
ID1 ID2
A 5 1
B 4 3
C 9 5
But I'm not sure how to increment individuals. What is the best way to do this within the function?
You can use split.default to split data on similarly named columns and sum them row-wise.
sapply(split.default(df, sub('-.*', '', names(df))), rowSums, na.rm. = TRUE)
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))
Passing individuals as my argument in function(x) fixed my issue
bcdata_total <- sapply(individuals, function(individuals) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
An option with tidyverse
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column('rn') %>%
pivot_longer(cols = -rn, names_to = c(".value", "grp"), names_sep="-") %>%
group_by(rn) %>%
summarise(across(starts_with('ID'), sum, na.rm = TRUE), .groups = 'drop') %>%
column_to_rownames('rn')
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))

Looping through columns and duplicating data in R

I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3

Removing all columns summing to zero with dplyr

I'm currently working on a dataframe that looks something like this:
Site Spp1 Spp2 Spp3 LOC TYPE
S01 2 4 0 A FLOOD
S02 4 0 0 A REG
....
S10 0 1 0 B FLOOD
S11 1 0 0 B REG
What I'm trying to do is subset the dataframe so I can run some indicator species analysis in R.
The following code works in that I create two subsets of the data, merge them into one frame and then drop the unused factor levels
A.flood <- filter(data, TYPE == "FLOOD", LOC == "A")
B.flood <- filter(data, TYPE == "FLOOD", LOC == "B")
A.B.flood <- rbind(A.flood, B.flood) %>% droplevels.data.frame(A.B.flood, except = c("A", "B"))
What I was also hoping/need to do is to drop all Spp columns (in my real dataset there are ~ 60) that sum to zero. Is there a way to achieve this this with dplyr, and if there is, is it possible to pipe that code onto the existing A.B.flood dataframe code?
Thanks!
EDIT
I managed to remove all the columns that summed to zero, by selecting only the columns that summed to > zero:
A.B.flood.subset <- A.B.flood[, apply(A.B.flood[1:(ncol(A.B.flood))], 2, sum)!=0]
I realize this question is now quite old, but I came accross and found another solution using dplyr's "select" and "which", which might seem clearer to dplyr's enthusiasts:
A.B.flood.subset <- A.B.flood %>% select(which(!colSums(A.B.flood, na.rm=TRUE) %in% 0))
Without using any package, we can use rowSums of the 'Spp' columns (subset the columns using grep) and double negate so that rows with sum>0 will be TRUE and others FALSE. Use this index to subset the rows.
data[!!rowSums(data[grep('Spp', names(data))]),]
Or using dplyr/magrittr, we select the 'Spp' columns, get the sum of each row with Reduce, double negate and use extract from magrittr to subset the original dataset with the index derived.
library(dplyr)
library(magrittr)
data %>%
select(matches('^Spp')) %>%
Reduce(`+`, .) %>%
`!` %>%
`!` %>%
extract(data,.,)
data
data <- structure(list(Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L), Spp2 = c(4L, 0L, 0L, 0L), Spp3 = c(0L, 0L, 0L, 0L
), LOC = c("A", "A", "A", "A"), TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")), .Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))
You should convert to tidy data with tidyr::gather() and the data frame will be much easier to manipulate.
library(tidyr)
library(dplyr)
A.B.Flood %>% gather(Species, Sp.Count, -Site, -LOC, -TYPE) %>%
group_by(Species) %>%
filter(Sp.Count > 0)
Voila, your tidy data minus the zero counts.
# Site LOC TYPE Species Sp.Count
# <fctr> <fctr> <fctr> <chr> <int>
#1 S01 A FLOOD Spp1 2
#2 S02 A REG Spp1 4
#3 S11 B REG Spp1 1
#4 S01 A FLOOD Spp2 4
#5 S10 B FLOOD Spp2 1
Personally I'd keep it like this. If you want your original format back with the zero counts for the non-discarded species, just add %>% spread(Species, Sp.Count, fill = 0) to the pipeline.
# Site LOC TYPE Spp1 Spp2
#* <fctr> <fctr> <fctr> <dbl> <dbl>
#1 S01 A FLOOD 2 4
#2 S02 A REG 4 0
#3 S10 B FLOOD 0 1
#4 S11 B REG 1 0
There is an even easier and quicker way to do this (and also more in line with your question: using dplyr).
A.B.flood.subset <- A.B.flood[, colSums(A.B.flood != 0) > 0]
or with a MWE:
df <- data.frame (x = rnorm(100), y = rnorm(100), z = rep(0, 100))
df[, colSums(df != 0) > 0]
For those who want to use dplyr 1.0.0 with the where keyword, you can do:
A.B.flood %>%
select(where( ~ is.numeric(.x) && sum(.x) != 0))
returns:
Spp1 Spp2
1 2 4
2 4 0
3 0 0
4 4 0
using the same data given by #akrun:
A.B.flood <- structure(
list(
Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L),
Spp2 = c(4L, 0L, 0L, 0L),
Spp3 = c(0L, 0L, 0L, 0L),
LOC = c("A", "A", "A", "A"),
TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")
),
.Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))

Resources