I am trying to add two columns. My dataframe is like this one:
data <- data.frame(a = c(0,1,NA,0,NA,NA),
x = c(NA,NA,NA,NA,1,0),
t = c(NA,2,NA,NA,2,0))
I want to add some of the columns like this:
yep <- cbind.data.frame( data$a, data$x, rowSums(data[,c(1, 2)], na.rm = TRUE))
However the output looks like this:
> yep
data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE)
1 0 NA 0
2 1 NA 1
3 NA NA 0
4 0 NA 0
5 NA 1 1
6 NA 0 0
And I would like an oputput like this:
> yep
data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE)
1 0 NA 0
2 1 NA 1
3 NA NA NA
4 0 NA 0
5 NA 1 1
6 NA 0 0
If the columns contain only NA values I want to leave the NA values.
How I could achive this?
Base R:
data <- data.frame("a" = c(0,1,NA,0,NA,NA),
"x" = c(NA,NA,NA,NA,1,0),
"t" = c(NA,2,NA,NA,2,0)
)
yep <- cbind.data.frame( data$a, data$x, rs = rowSums(data[,c(1, 2)], na.rm = TRUE))
yep$rs[is.na(data$a) & is.na(data$x)] <- NA
yep
Base R (ifelse):
cbind(data$a,data$x,ifelse(is.na(data$a) & is.na(data$x),NA,rowSums(data[,1:2],na.rm = TRUE)))
If you are looking for the column name then replace cbind with cbind.data.frame
Output:
[,1] [,2] [,3]
[1,] 0 NA 0
[2,] 1 NA 1
[3,] NA NA NA
[4,] 0 NA 0
[5,] NA 1 1
[6,] NA 0 0
You might try dplyr::coalesce
cbind.data.frame( data$a, data$x, dplyr::coalesce(data$a, data$x))
# data$a data$x dplyr::coalesce(data$a, data$x)
#1 0 NA 0
#2 1 NA 1
#3 NA NA NA
#4 0 NA 0
#5 NA 1 1
#6 NA 0 0
base r ifelse
data[['rowsum']]<-ifelse(is.na(data$a) & is.na(data$x),NA,ifelse(is.na(data$a),0,data$a)+ifelse(is.na(data$x),0,data$x))
a x t rowsum
1: 0 NA NA 0
2: 1 NA 2 1
3: NA NA NA NA
4: 0 NA NA 0
5: NA 1 2 1
6: NA 0 0 0
Another base R approach.
If all the values in the rows are NA then return NA or else return sum of the row ignoring NA's.
#Select only the columns which we need
sub_df <- data[c("a", "x")]
sub_df$answer <- ifelse(rowSums(is.na(sub_df)) == ncol(sub_df), NA,
rowSums(sub_df, na.rm = TRUE))
sub_df
# a x answer
#1 0 NA 0
#2 1 NA 1
#3 NA NA NA
#4 0 NA 0
#5 NA 1 1
#6 NA 0 0
Related
I have a list containing a large set of data frames (each data frame represents a year). Each data frame consists of two variables: (1) ID-variable, and (2) a Bernoulli variable indicating if an event happened that year. I want to want to unlist the list, and instead, create one data frame where each year's outcome variable is its own column.
Below, you can find code to generate example data that share the same feature as my data.
set.seed(123)
df <- list()
year <- (1950:1960)
# Simulating data.
for (i in 1:length(year)) {
id <- sample(1:20, 10)
outcome <- sample(0:1, 10, replace = T)
df[[i]] <- cbind(id, outcome)
}
So in practical terms, I want to unlist df and create a data frame that looks something like this:
id outcome1950 outcome1951 outcome1952 outcome1953 outcome1954 outcome1955 [...]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
[7,] 7
[8,] 8
[9,] 9
[10,] 10
[11,] 11
[12,] 12
[13,] 13
[14,] 14
[15,] 15
[16,] 16
[17,] 17
[18,] 18
[19,] 19
[20,] 20
Of course, the outcome should have each value for each year.
NOTE: All ID's doesn't exist in all years, in those cases, I want to add a NA.
library(tidyverse)
names(df) <- paste0('outcome', year)
df %>%
purrr::map_df(as.data.frame, .id = 'name') %>%
tidyr::pivot_wider(names_from = name, values_from = outcome) %>%
dplyr::arrange(id)
# A tibble: 20 x 12
id outcome1950 outcome1951 outcome1952 outcome1953 outcome1954 outcome1955 outcome1956 outcome1957 outcome1958 outcome1959 outcome1960
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 NA 0 NA 0 NA 0 NA NA 0 NA NA
2 2 1 NA 0 1 NA 1 NA 1 NA 0 NA
3 3 1 0 NA NA 1 NA NA 0 0 NA NA
4 4 0 0 NA NA NA 0 NA 1 0 1 NA
5 5 0 NA 0 NA NA NA 1 NA NA NA NA
6 6 0 NA 0 1 NA 1 1 NA 0 1 NA
7 7 NA 1 0 0 1 NA 1 0 NA NA 1
8 8 NA 0 0 0 1 NA NA 1 1 0 0
9 9 NA 0 0 NA 1 NA NA 1 NA 0 0
10 10 0 0 0 NA NA 0 1 NA NA 1 1
11 11 0 NA NA 1 NA NA 1 NA 1 0 NA
12 12 NA NA NA NA 0 1 NA NA 1 0 0
13 13 NA NA 0 NA 1 NA NA 1 0 1 0
14 14 0 1 NA NA 0 NA 0 NA NA 0 0
15 15 1 NA 1 0 NA 0 1 NA 1 NA NA
16 16 NA NA NA 0 0 0 NA NA 0 NA 0
17 17 NA NA NA NA NA NA NA 1 NA NA NA
18 18 NA NA NA NA NA NA 1 NA NA NA 1
19 19 1 0 NA 0 0 1 0 1 NA NA NA
20 20 NA 1 1 1 0 0 0 1 NA NA 0
This should do it.
set.seed(123)
df <- list()
year <- (1950:1960)
# Simulating data.
for (i in 1:length(year)) {
id <- sample(1:20, 10)
outcome <- sample(0:1, 10, replace = T)
df[[i]] <- cbind(id, outcome)
}
## add names to df
names(df) <- year
## turn each element in df into a tibble (or data frame) and make a
## variable called "year" that has is "outcomeYYYY"
for(i in 1:length(df)){
df[[i]] <- as_tibble(df[[i]]) %>% mutate(year = paste0("outcome", names(df)[i]))
}
## turn the elements of the list into a single data tibble
df <- bind_rows(df)
## pivot to wider and arrange by id
dfwide <- df %>%
pivot_wider(names_from="year", values_from="outcome") %>%
arrange(id)
head(dfwide)
# A tibble: 6 x 12
# id outcome1950 outcome1951 outcome1952 outcome1953 outcome1954
# <int> <int> <int> <int> <int> <int>
# 1 1 NA 0 NA 0 NA
# 2 2 1 NA 0 1 NA
# 3 3 1 0 NA NA 1
# 4 4 0 0 NA NA NA
# 5 5 0 NA 0 NA NA
# 6 6 0 NA 0 1 NA
# … with 6 more variables: outcome1955 <int>, outcome1956 <int>,
# outcome1957 <int>, outcome1958 <int>, outcome1959 <int>,
# outcome1960 <int>
if I understood your problem correctly a possible solution would be this:
library(dplyr)
library(tidyr)
library(purrr)
library(plyr)
purrr::map2(df, year, ~ cbind(.y, .x)) %>%
plyr::ldply(data.frame) %>%
dplyr::rename(YEAR = 1) %>%
tidyr::pivot_wider(names_from = "YEAR", names_prefix = "outcome", values_from = "outcome") %>%
dplyr::arrange(id)
This question already has answers here:
replace Yes, No to 1, 0 in multiple columns in r [duplicate]
(4 answers)
Closed 2 years ago.
I'm hoping that someone can help me :)
I have a data frame with about 1000 columns.
Within that, I have columns named like this:
X1,X2,X3,X4,X5,X6 etc... Y1,Y2,Y3,Y4,Y5,Y6 etc...
df <- data.frame("X1" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"X2" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"X3" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"X4" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"X5" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"X6" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"Y1" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"Y2" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"Y3" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"Y4" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"Y5" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"Y6" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"))
In certain columns, I replace "Yes" with 1, and "No" with 0, and replace anything else with an NA.
I have tried this:
names = c("X","Y")
for (name in names){
try(
for (j in 1:6){
j <- toString(j)
colname <- paste(name , j, sep="")
df$colname <- gsub("Yes", as.integer(1), df$colname)
df$colname <- gsub("No", as.integer(0), df$colname)
})}
However, this is not working, throwing error message:
Error in `$<-.data.frame`(`*tmp*`, "colname", value = character(0)) : replacement has 0 rows, data has 13
My first question is: Why are the column names not referencing properly?
Second question is: How do I replace anything that's not a 0 or 1 in those columns with an "NA"?
This is possibly a really simple thing that I'm overlooking, but I can't quite figure out how to do it.
Any help would be greatly appreciated.
Many thanks in advance,
Rich
I wouldn't use a loop or gsub here, you can use this:
df[] <- lapply(df, function(x) x <- car::recode(x, "'Yes'=1; 'No'=0; 'NA'=NA"))
This iterates over each column in your dataframe and recodes the values as you want. This is also easier to expand if you get more values in the future.
If you only want certain columns, you can modify it like this:
df[, col_list] <- lapply(df[, col_list], function(x) x <- car::recode(x, "'Yes'=1; 'No'=0; 'NA'=NA"))
Where col_list is the vector of the variables you want to change. You could grep for them using col_list <- grep('^X|Y', names(df), value = T)
Since your data has only 'Yes', 'No' and 'NA' values you can also directly replace them.
#Column numbers to replace
cols <- grep('^[XY]\\d+', names(df))
#Replace "NA" with real NA
df[cols][df[cols] == 'NA'] <- NA
#Replace "Yes" with 1
df[cols][df[cols] == 'Yes'] <- 1
#Replace "No" with 0
df[cols][df[cols] == 'No'] <- 0
#Change dataframe type.
df <- type.convert(df)
df
# X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6
#1 1 1 1 1 1 1 1 1 1 1 1 1
#2 0 NA NA 0 NA NA 0 NA NA 0 NA NA
#3 1 NA NA 1 NA NA 1 NA NA 1 NA NA
#4 NA NA NA NA NA NA NA NA NA NA NA NA
#5 NA NA 1 NA NA 1 NA NA 1 NA NA 1
#6 NA 1 0 NA 1 0 NA 1 0 NA 1 0
#7 1 NA 1 1 NA 1 1 NA 1 1 NA 1
#8 0 NA NA 0 NA NA 0 NA NA 0 NA NA
#9 1 NA 1 1 NA 1 1 NA 1 1 NA 1
#10 NA NA NA NA NA NA NA NA NA NA NA NA
#11 NA 1 NA NA 1 NA NA 1 NA NA 1 NA
#12 NA NA NA NA NA NA NA NA NA NA NA NA
#13 NA NA 1 NA NA 1 NA NA 1 NA NA 1
If you are using R < 4.0.0, you first need to convert data into characters.
df[] <- lapply(df, as.character)
I have a matrix of pairwise comparisons of all plots in my dataset. Matrix fill represents shared species among plots.
Plot4 Plot5 Plot6 Plot7 Plot8 Plot9 Plot10
Plot4 NA NA NA NA NA NA NA
Plot5 0 NA NA NA NA NA NA
Plot6 1 0 NA NA NA NA NA
Plot7 0 0 0 NA NA NA NA
Plot8 0 1 0 0 NA NA NA
Plot9 0 1 0 0 2 NA NA
Plot10 0 0 0 0 1 1 NA
This matrix came from the following dataframe:
data<-
region plot species
1 104 A_B
1 105 B_C
1 106 A_B
1 107 C_D
2 108 B_C
2 108 E_F
2 109 B_C
2 109 E_F
2 110 E_F
These plots are associated with certain regions. I generated the following loop that creates this pairwise comparison matrix for all 500 plots:
plots<-unique(data$plot)
plot.num<-length(plots)
output<-matrix(0, plot.num, plot.num)
for (i in 1:plot.num) {
for (j in 1:plot.num) {
plot_i<-data[data$plot==plots[i],]
plot_j<-data[data$plot==plots[j],]
output[i,j]<-length(intersect(plot_i$species, plot_j$species))
}
}
F.mat<-output
F.mat[lower.tri(F.mat, diag=T)]<-0
However, now I want to create a loop that subsets the larger matrix above by region to make a list of regional matrices.
output<-
[[1]]
Plot4 Plot5 Plot6 Plot7
Plot4 NA NA NA NA
Plot5 0 NA NA NA
Plot6 1 0 NA NA
Plot7 0 0 0 NA
[[2]] Plot8 Plot9 Plot10
Plot8 NA NA NA
Plot9 2 NA NA
Plot10 1 1 NA
NOTE: This is a quantitative matrix not presence/absence.
You could put your evaluation into a function and then lapply over the regions:
countFun <- function(relData){
plots <- unique(relData$plot)
plot.num <- length(plots)
output <- matrix(NA, plot.num, plot.num)
if (plot.num > 1){
for (i in 2:plot.num) {
for (j in 1:(i-1)) {
plot_i <- relData[relData$plot==plots[i],]
plot_j <- relData[relData$plot==plots[j],]
output[i,j] <- length(intersect(plot_i$species, plot_j$species))
}
}
}
output
}
lapply(unique(data$region), function(region) countFun(data[data$region == region,]))
# [[1]]
# [,1] [,2] [,3] [,4]
# [1,] NA NA NA NA
# [2,] 0 NA NA NA
# [3,] 1 0 NA NA
# [4,] 0 0 0 NA
#
# [[2]]
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] 2 NA NA
# [3,] 1 1 NA
I have two overlapping matrices with some shared columns and rows:
m.1 = matrix(c(NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,1,1,1,NA,NA,NA,1,NA,NA), ncol=5)
colnames(m.1) <- c("-2","-1","0","1","2")
rownames(m.1) <- c("-2","-1","0","1","2")
## -2 -1 0 1 2
## -2 NA NA 1 NA NA
## -1 NA 1 1 1 NA
## 0 1 1 1 1 1
## 1 NA 1 1 1 NA
## 2 NA NA 1 NA NA
m.2 = matrix(c(NA,2,NA,2,2,2,NA,2,NA), ncol=3)
colnames(m.2) <- c("-1","0","1")
rownames(m.2) <- c("-1","0","1")
## -1 0 1
## -1 NA 2 NA
## 0 2 2 2
## 1 NA 2 NA
Now I want to pass the maximum value in each column from m.1 and m.2 to a new matrix m.max, which should look like this:
## -2 -1 0 1 2
## -2 NA NA 1 NA NA
## -1 NA 1 2 1 NA
## 0 1 2 2 2 1
## 1 NA 1 2 1 NA
## 2 NA NA 1 NA NA
Based on previous threads, I have meddled with merge(), replace() and match() but cannot get the desired result at all, e.g.
m.max<- merge(m.1,m.2, by = "row.names", all=TRUE, sort = TRUE)
## Row.names -2 -1.x 0.x 1.x 2 -1.y 0.y 1.y
## 1 -1 NA 1 1 1 NA NA 2 NA
## 2 -2 NA NA 1 NA NA NA NA NA
## 3 0 1 1 1 1 1 2 2 2
## 4 1 NA 1 1 1 NA NA 2 NA
## 5 2 NA NA 1 NA NA NA NA NA
Please help! Am I completely on the wrong track? Does this operation require a different kind of object than matrix? For example, I also tried to convert the matrices into raster objects and do cell statistics, but ran into problems because of the unequal dimensions of m.1 and m.2.
Importantly, the answer should also work for much larger objects, or whether I want to calculate the maximum, minimum or sum.
You can use pmax:
#we create a new matrix as big as m.1 with the values of m.2 in it
mres<-array(NA,dim(m.1),dimnames(m.1))
mres[rownames(m.2),colnames(m.2)]<-m.2
#Then we use pmax
pmax(m.1,mres,na.rm=TRUE)
# -2 -1 0 1 2
#-2 NA NA 1 NA NA
#-1 NA 1 2 1 NA
#0 1 2 2 2 1
#1 NA 1 2 1 NA
#2 NA NA 1 NA NA
I would like to randomly delete up to three elements per row of a data set containing five columns. Below is R code I thought would do it, but it allows up to all five elements in a row to be deleted. This seems basic, but I cannot find the error. Thank you for any advice.
set.seed(1234)
# create matrix to contain flags identifying elements to be deleted
delete.these <- matrix(0, nrow=10, ncol=5)
for(i in 1:nrow(delete.these)) {
# for each row randomly select the order of the columns
# to be tested for deletion
rcols <- sample(5, 5, replace = FALSE)
for(j in 1:ncol(delete.these)) {
# select a random draw
delete.it <- runif(1,0,1)
# if random draw is below specified threshold and fewer than three
# elements have already been deleted from the row then delete element
if((delete.it <= 0.7) & sum(delete.these[i,1:5] <= 2)) { delete.these[i,rcols[j]] = 1}
if((delete.it > 0.7) | sum(delete.these[i,1:5] >= 3)) { delete.these[i,rcols[j]] = 0}
}
}
delete.these
Instead of using runif() try drawing the indices directly
delete.these <- matrix(0, nrow=10, ncol=5)
for (i in 1:NROW(delete.these)){
delete.these[i,sample.int(5,sample.int(4,1)-1)] <- 1
}
delete.these
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 0 0
[2,] 0 0 0 0 0
[3,] 0 1 0 1 1
[4,] 0 1 1 0 1
[5,] 1 0 1 0 0
[6,] 0 0 0 0 0
[7,] 1 0 1 0 0
[8,] 0 1 0 1 1
[9,] 0 1 1 0 0
[10,] 1 0 1 0 1
By the way your code doesn't work because of a misplaced paren.
sum(delete.these[i,1:5] <= 2)
should be instead
sum(delete.these[i,1:5]) <= 2
It would be easier (and much faster) to delete with a two column-matrix as an argument to [<-. You did not propose a test case but I will:
dfrm <- data.frame(a1=rnorm(20), a2=rnorm(20),a3=rnorm(20),
a4=rnorm(20),a5=rnorm(20))
dfrm[ matrix( c( rep(1:20,each=3),
replicate(20, {sample(5, 3)} ) ), ncol=2) ] <- NA
> dfrm
a1 a2 a3 a4 a5
1 NA 0.70871541 NA NA -0.6922827
2 1.9846227 1.70592512 NA NA NA
3 0.2684487 NA 0.0008968694 NA NA
4 NA NA 0.5546355410 0.07399188 NA
5 NA 0.82324761 -0.0410918599 NA NA
6 NA NA -1.0715205164 NA -0.1683819
7 0.0933059 NA NA NA 1.3129301
8 NA 0.79382695 0.1877369725 NA NA
9 0.3124101 NA NA -1.22087347 NA
10 -0.1657043 NA NA 1.36626832 NA
11 NA -0.06095247 -0.9622792102 NA NA
12 NA -1.29243386 -1.2133819819 NA NA
13 -0.0886702 NA NA 0.37495775 NA
14 1.0812527 -1.54215156 NA NA NA
15 NA -0.24765627 NA 0.81374405 NA
16 NA 0.21307051 NA NA -0.6825013
17 -0.4129100 NA NA NA -0.9844177
18 NA 1.95881167 0.7977172969 NA NA
19 NA NA 0.0953287645 NA 1.7067591
20 NA NA -0.1057690912 0.73408897 NA
This is assuming that by "delete" you meant set to missing. If the intent were something else you will need to supply a test case and clarify.
This (nested sampling strategy will provide a variable number of rows in the indexing matrix per row of the target matrix:
idx <- sapply(1:20, function(x) {n<- sample(1:5, sample(1:3,1))
matrix( c(rep(x,length(n)), n), ncol=2) }) # list
idx <- do.call(rbind, idx) # now a 2 col matrix
dfrm[ idx] <- NA
> idx <- sapply(1:20, function(x) {n<- sample(1:5, sample(1:3,1))
+ matrix( c(rep(x,length(n)), n), ncol=2) }) # list
> idx <- do.call(rbind, idx) # now a 2 col matrix
>
> dfrm[ idx] <- NA
>
> dfrm
a1 a2 a3 a4 a5
1 -0.048776740 NA 1.1879195 -0.23142932 -3.6185891
2 NA 0.4613289 -0.4532400 -0.85891682 -2.2034714
3 NA NA 1.1191833 1.12545821 NA
4 0.646399767 -0.7126735 2.9474470 0.36358070 NA
5 -0.630929314 1.3770828 NA NA 1.3987857
6 NA NA NA 1.06680025 0.4445383
7 0.484728630 NA 0.7382064 NA 0.9838159
8 -1.558031074 1.1630888 NA NA NA
9 -0.968887379 -0.7330051 NA 0.04621124 -0.9785049
10 0.935436533 NA NA -1.07365274 NA
11 NA 0.2529093 NA -1.38643245 -1.3389529
12 NA -0.2639166 -0.2301257 NA NA
13 2.026646586 -0.2452684 NA -0.30346521 NA
14 0.522717033 NA NA 1.25870278 NA
15 NA NA -0.9934046 -0.89009964 -0.8403772
16 NA NA 0.0987765 -0.98608109 1.4646301
17 NA 0.7693064 -0.9326388 -0.16240266 NA
18 -0.005393965 NA NA NA -0.8111057
19 NA 1.6241122 -1.1376916 0.15812435 NA
20 NA NA NA 0.71059666 0.5170046