Combining gsub() and using variable names as columns in R [duplicate] - r

This question already has answers here:
replace Yes, No to 1, 0 in multiple columns in r [duplicate]
(4 answers)
Closed 2 years ago.
I'm hoping that someone can help me :)
I have a data frame with about 1000 columns.
Within that, I have columns named like this:
X1,X2,X3,X4,X5,X6 etc... Y1,Y2,Y3,Y4,Y5,Y6 etc...
df <- data.frame("X1" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"X2" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"X3" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"X4" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"X5" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"X6" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"Y1" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"Y2" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"Y3" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"Y4" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"Y5" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"Y6" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"))
In certain columns, I replace "Yes" with 1, and "No" with 0, and replace anything else with an NA.
I have tried this:
names = c("X","Y")
for (name in names){
try(
for (j in 1:6){
j <- toString(j)
colname <- paste(name , j, sep="")
df$colname <- gsub("Yes", as.integer(1), df$colname)
df$colname <- gsub("No", as.integer(0), df$colname)
})}
However, this is not working, throwing error message:
Error in `$<-.data.frame`(`*tmp*`, "colname", value = character(0)) : replacement has 0 rows, data has 13
My first question is: Why are the column names not referencing properly?
Second question is: How do I replace anything that's not a 0 or 1 in those columns with an "NA"?
This is possibly a really simple thing that I'm overlooking, but I can't quite figure out how to do it.
Any help would be greatly appreciated.
Many thanks in advance,
Rich

I wouldn't use a loop or gsub here, you can use this:
df[] <- lapply(df, function(x) x <- car::recode(x, "'Yes'=1; 'No'=0; 'NA'=NA"))
This iterates over each column in your dataframe and recodes the values as you want. This is also easier to expand if you get more values in the future.
If you only want certain columns, you can modify it like this:
df[, col_list] <- lapply(df[, col_list], function(x) x <- car::recode(x, "'Yes'=1; 'No'=0; 'NA'=NA"))
Where col_list is the vector of the variables you want to change. You could grep for them using col_list <- grep('^X|Y', names(df), value = T)

Since your data has only 'Yes', 'No' and 'NA' values you can also directly replace them.
#Column numbers to replace
cols <- grep('^[XY]\\d+', names(df))
#Replace "NA" with real NA
df[cols][df[cols] == 'NA'] <- NA
#Replace "Yes" with 1
df[cols][df[cols] == 'Yes'] <- 1
#Replace "No" with 0
df[cols][df[cols] == 'No'] <- 0
#Change dataframe type.
df <- type.convert(df)
df
# X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6
#1 1 1 1 1 1 1 1 1 1 1 1 1
#2 0 NA NA 0 NA NA 0 NA NA 0 NA NA
#3 1 NA NA 1 NA NA 1 NA NA 1 NA NA
#4 NA NA NA NA NA NA NA NA NA NA NA NA
#5 NA NA 1 NA NA 1 NA NA 1 NA NA 1
#6 NA 1 0 NA 1 0 NA 1 0 NA 1 0
#7 1 NA 1 1 NA 1 1 NA 1 1 NA 1
#8 0 NA NA 0 NA NA 0 NA NA 0 NA NA
#9 1 NA 1 1 NA 1 1 NA 1 1 NA 1
#10 NA NA NA NA NA NA NA NA NA NA NA NA
#11 NA 1 NA NA 1 NA NA 1 NA NA 1 NA
#12 NA NA NA NA NA NA NA NA NA NA NA NA
#13 NA NA 1 NA NA 1 NA NA 1 NA NA 1
If you are using R < 4.0.0, you first need to convert data into characters.
df[] <- lapply(df, as.character)

Related

Applying a for-loop to different levels of a variable

I have created a data frame, in the data frame there are 3 sites and I have created a nested for loop to create my desired matrices. THe overall objective is find a more efficient way to do this for each of the 3 sites instead of just the one.
The outputs from the nested for loop (EDmatrix and timelags) are the expected results for the other two sites. I would like to find a more efficient way of obtaining these matrices as well as be able to do it for all site instead of just the one in this example.
set.seed(123)
d1 = sample.int(50, 27)
d2 = sample.int(50, 27)
d3 = sample.int(50, 27)
year <- c(1990:1998)
site <- c(rep("a", 9), rep("b", 9), rep("c", 9))
ED = function(x,y){
#x and y are vectors of spp abundances
#they must be the same length!
if(length(x)!=length(y)) stop("Bad abundances!")
out = sqrt(sum((x-y)^2))
out
}
df <- data.frame(site, year, d1 = d1, d2 = d2, d3 = d3)
Here is the code to get the expected output for only a single site, but I would like to be able to do this for all of the sites in the data frame df.
subdf = subset(df,site=="a") # subset data for one site
EDmatrix = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the dissimilarity values
timeLags = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the time lags
# First loop through all "j" years from 1 to the total number of years
# Now loop through all "k" years from 1 to the total number of years
for(j in 1: length(subdf$year)){
for(k in 1: length(subdf$year)){
# grab density data for year "j"
jdensity <- subdf[j,-c(1:2)]
# grab density data for year "k"
kdensity <- subdf[k,-c(1:2)]
# calculate and store (in the EDmatrix) the ED value based on the data for year j and k
EDmatrix[j,k] <- ED(jdensity, kdensity)
# calculate and store (in timeLags) the time lag (the absolute value of the difference
# in time between year j and k
timeLags[j,k] <- abs(subdf[j, 2] - subdf[k, 2])
}# exit k loop
}# exit j loop
EDmatrix[lower.tri(EDmatrix, diag=T)]=NA # set duplicate entries to NA
timeLags[lower.tri(timeLags, diag=T)]=NA # set duplicate entries to NA
y = as.vector(EDmatrix) # turn the matrix into a vector
x = as.vector(timeLags)
We may use outer for this operation
library(dplyr)
library(tidyr)
library(purrr)
f1 <- function(dat, i, j) {
subdat <- dat %>%
select(starts_with('d'))
jdensity <- subdat[i, ]
kdensity <- subdat[j,]
EDtmp <- ED(jdensity, kdensity)
timetmp <- abs(dat$year[i] - dat$year[j])
tibble(EDtmp, timetmp)
}
f2 <- function(dat, s1, s2) {
mat <- outer(s1, s2, Vectorize(\(i, j) list(f1(dat, i, j))))
EDmatrix <- matrix(map_dbl(mat, ~ .x$EDtmp), length(s1), length(s1))
timeLags <- matrix(map_dbl(mat, ~ .x$timetmp), length(s1), length(s1))
EDmatrix[lower.tri(EDmatrix, diag=TRUE)]=NA
timeLags[lower.tri(timeLags, diag=TRUE)]=NA
y = as.vector(EDmatrix)
x = as.vector(timeLags)
tibble(y, x)
}
out1 <- df %>%
group_by(site) %>%
summarise(out = f2(cur_data(), row_number(), row_number()),
.groups = 'drop') %>%
unnest(out)
-checking with OP's output
> out1$x[out1$site == "a"]
[1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3
[49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA
> x
[1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3
[49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA
> out1$y[out1$site == "a"]
[1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA
[15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292
[29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA
[43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978
[57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937
[71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA
> y
[1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA
[15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292
[29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA
[43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978
[57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937
[71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA

Create many new empty columns in an existing R data frame from a list of column names [duplicate]

I have following dataframe and vector:
ddf = data.frame(a=rep(1,10), b=rep(2,10))
xx = c("c", "d", "e", "f")
How can I new empty columns which are named with items in xx ?
I tried following but it does not work:
ddf = cbind(ddf, data.frame(xx))
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 10, 4
Following also does not work:
for(i in 1:length(xx)){
ddf$(xx[i]) = ""
}
Error: unexpected '(' in:
"for(i in 1:length(xx)){
ddf$("
}
Error: unexpected '}' in "}"
This will get you there:
ddf[xx] <- NA
# a b c d e f
#1 1 2 NA NA NA NA
#2 1 2 NA NA NA NA
#3 1 2 NA NA NA NA
#...
You can't directly use something like ddf$xx because this will try to assign to a column called xx rather than interpreting xx. You need to use [ and [<- functions, using the square brackets when you are dealing with a character string/vector - like ddf["columnname"] or ddf[c("col1","col2")], or a stored vector like your ddf[xx].
The reason why it selects columns is because data.frames are lists essentially:
is.list(ddf)
#[1] TRUE
as.list(ddf)
#$a
# [1] 1 1 1 1 1 1 1 1 1 1
#
#$b
# [1] 2 2 2 2 2 2 2 2 2 2
...with each column corresponding to a list entry. So if you don't use a comma to specify a row, like ddf["name",] or a column like ddf[,"name"], you get the column by default.
In the case that you are working with a 0-row dataset, you can not use a value like NA as the replacement. Instead, replace with list(character(0)) where character(0) can be substituted for numeric(0), integer(0), logical(0) etc, depending on the class you want for your new columns.
ddf <- data.frame(a=character())
xx <- c("c", "d", "e", "f")
ddf[xx] <- list(character(0))
ddf
#[1] a c d e f
#<0 rows> (or 0-length row.names)
This seems to succeed:
> cbind(ddf, setNames( lapply(xx, function(x) x=NA), xx) )
a b c d e f
1 1 2 NA NA NA NA
2 1 2 NA NA NA NA
3 1 2 NA NA NA NA
4 1 2 NA NA NA NA
5 1 2 NA NA NA NA
6 1 2 NA NA NA NA
7 1 2 NA NA NA NA
8 1 2 NA NA NA NA
9 1 2 NA NA NA NA
10 1 2 NA NA NA NA

Merge two columns maintaning missing values

I am trying to add two columns. My dataframe is like this one:
data <- data.frame(a = c(0,1,NA,0,NA,NA),
x = c(NA,NA,NA,NA,1,0),
t = c(NA,2,NA,NA,2,0))
I want to add some of the columns like this:
yep <- cbind.data.frame( data$a, data$x, rowSums(data[,c(1, 2)], na.rm = TRUE))
However the output looks like this:
> yep
data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE)
1 0 NA 0
2 1 NA 1
3 NA NA 0
4 0 NA 0
5 NA 1 1
6 NA 0 0
And I would like an oputput like this:
> yep
data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE)
1 0 NA 0
2 1 NA 1
3 NA NA NA
4 0 NA 0
5 NA 1 1
6 NA 0 0
If the columns contain only NA values I want to leave the NA values.
How I could achive this?
Base R:
data <- data.frame("a" = c(0,1,NA,0,NA,NA),
"x" = c(NA,NA,NA,NA,1,0),
"t" = c(NA,2,NA,NA,2,0)
)
yep <- cbind.data.frame( data$a, data$x, rs = rowSums(data[,c(1, 2)], na.rm = TRUE))
yep$rs[is.na(data$a) & is.na(data$x)] <- NA
yep
Base R (ifelse):
cbind(data$a,data$x,ifelse(is.na(data$a) & is.na(data$x),NA,rowSums(data[,1:2],na.rm = TRUE)))
If you are looking for the column name then replace cbind with cbind.data.frame
Output:
[,1] [,2] [,3]
[1,] 0 NA 0
[2,] 1 NA 1
[3,] NA NA NA
[4,] 0 NA 0
[5,] NA 1 1
[6,] NA 0 0
You might try dplyr::coalesce
cbind.data.frame( data$a, data$x, dplyr::coalesce(data$a, data$x))
# data$a data$x dplyr::coalesce(data$a, data$x)
#1 0 NA 0
#2 1 NA 1
#3 NA NA NA
#4 0 NA 0
#5 NA 1 1
#6 NA 0 0
base r ifelse
data[['rowsum']]<-ifelse(is.na(data$a) & is.na(data$x),NA,ifelse(is.na(data$a),0,data$a)+ifelse(is.na(data$x),0,data$x))
a x t rowsum
1: 0 NA NA 0
2: 1 NA 2 1
3: NA NA NA NA
4: 0 NA NA 0
5: NA 1 2 1
6: NA 0 0 0
Another base R approach.
If all the values in the rows are NA then return NA or else return sum of the row ignoring NA's.
#Select only the columns which we need
sub_df <- data[c("a", "x")]
sub_df$answer <- ifelse(rowSums(is.na(sub_df)) == ncol(sub_df), NA,
rowSums(sub_df, na.rm = TRUE))
sub_df
# a x answer
#1 0 NA 0
#2 1 NA 1
#3 NA NA NA
#4 0 NA 0
#5 NA 1 1
#6 NA 0 0

Subtracting Columns Except When There is NA

I am trying to create a new variable that subtracts two columns only when both columns do not have NA, but has NA whenever one of the columns has NA. When I try to just subtract the columns, I only get a columns of NA. For instance, I am writing the command:
d$x3 <- d$x2 - d$x1
When I use the command above, I get:
x1 x2 x3
1 3 NA
1 NA NA
NA 3 NA
NA NA NA
Based on looking at some other posts online, I tried to doing a workaround where I changed x1 to negative numbers and then used rowSums command, but then I got this:
x3 <- rowSums(df[,c("x1","x2")], na.rm = TRUE)
x1 x2 x3
-1 3 2
-1 NA -1
NA 3 3
NA NA 0
What I am trying to produce is:
x1 x2 x3
1 3 2
1 NA NA
NA 3 NA
NA NA NA
Thanks for any help!
df <- read.table( text="x1 x2
1 3
1 NA
NA 3
NA NA", header=T)

randomly delete up to 3 elements per row

I would like to randomly delete up to three elements per row of a data set containing five columns. Below is R code I thought would do it, but it allows up to all five elements in a row to be deleted. This seems basic, but I cannot find the error. Thank you for any advice.
set.seed(1234)
# create matrix to contain flags identifying elements to be deleted
delete.these <- matrix(0, nrow=10, ncol=5)
for(i in 1:nrow(delete.these)) {
# for each row randomly select the order of the columns
# to be tested for deletion
rcols <- sample(5, 5, replace = FALSE)
for(j in 1:ncol(delete.these)) {
# select a random draw
delete.it <- runif(1,0,1)
# if random draw is below specified threshold and fewer than three
# elements have already been deleted from the row then delete element
if((delete.it <= 0.7) & sum(delete.these[i,1:5] <= 2)) { delete.these[i,rcols[j]] = 1}
if((delete.it > 0.7) | sum(delete.these[i,1:5] >= 3)) { delete.these[i,rcols[j]] = 0}
}
}
delete.these
Instead of using runif() try drawing the indices directly
delete.these <- matrix(0, nrow=10, ncol=5)
for (i in 1:NROW(delete.these)){
delete.these[i,sample.int(5,sample.int(4,1)-1)] <- 1
}
delete.these
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 0 0
[2,] 0 0 0 0 0
[3,] 0 1 0 1 1
[4,] 0 1 1 0 1
[5,] 1 0 1 0 0
[6,] 0 0 0 0 0
[7,] 1 0 1 0 0
[8,] 0 1 0 1 1
[9,] 0 1 1 0 0
[10,] 1 0 1 0 1
By the way your code doesn't work because of a misplaced paren.
sum(delete.these[i,1:5] <= 2)
should be instead
sum(delete.these[i,1:5]) <= 2
It would be easier (and much faster) to delete with a two column-matrix as an argument to [<-. You did not propose a test case but I will:
dfrm <- data.frame(a1=rnorm(20), a2=rnorm(20),a3=rnorm(20),
a4=rnorm(20),a5=rnorm(20))
dfrm[ matrix( c( rep(1:20,each=3),
replicate(20, {sample(5, 3)} ) ), ncol=2) ] <- NA
> dfrm
a1 a2 a3 a4 a5
1 NA 0.70871541 NA NA -0.6922827
2 1.9846227 1.70592512 NA NA NA
3 0.2684487 NA 0.0008968694 NA NA
4 NA NA 0.5546355410 0.07399188 NA
5 NA 0.82324761 -0.0410918599 NA NA
6 NA NA -1.0715205164 NA -0.1683819
7 0.0933059 NA NA NA 1.3129301
8 NA 0.79382695 0.1877369725 NA NA
9 0.3124101 NA NA -1.22087347 NA
10 -0.1657043 NA NA 1.36626832 NA
11 NA -0.06095247 -0.9622792102 NA NA
12 NA -1.29243386 -1.2133819819 NA NA
13 -0.0886702 NA NA 0.37495775 NA
14 1.0812527 -1.54215156 NA NA NA
15 NA -0.24765627 NA 0.81374405 NA
16 NA 0.21307051 NA NA -0.6825013
17 -0.4129100 NA NA NA -0.9844177
18 NA 1.95881167 0.7977172969 NA NA
19 NA NA 0.0953287645 NA 1.7067591
20 NA NA -0.1057690912 0.73408897 NA
This is assuming that by "delete" you meant set to missing. If the intent were something else you will need to supply a test case and clarify.
This (nested sampling strategy will provide a variable number of rows in the indexing matrix per row of the target matrix:
idx <- sapply(1:20, function(x) {n<- sample(1:5, sample(1:3,1))
matrix( c(rep(x,length(n)), n), ncol=2) }) # list
idx <- do.call(rbind, idx) # now a 2 col matrix
dfrm[ idx] <- NA
> idx <- sapply(1:20, function(x) {n<- sample(1:5, sample(1:3,1))
+ matrix( c(rep(x,length(n)), n), ncol=2) }) # list
> idx <- do.call(rbind, idx) # now a 2 col matrix
>
> dfrm[ idx] <- NA
>
> dfrm
a1 a2 a3 a4 a5
1 -0.048776740 NA 1.1879195 -0.23142932 -3.6185891
2 NA 0.4613289 -0.4532400 -0.85891682 -2.2034714
3 NA NA 1.1191833 1.12545821 NA
4 0.646399767 -0.7126735 2.9474470 0.36358070 NA
5 -0.630929314 1.3770828 NA NA 1.3987857
6 NA NA NA 1.06680025 0.4445383
7 0.484728630 NA 0.7382064 NA 0.9838159
8 -1.558031074 1.1630888 NA NA NA
9 -0.968887379 -0.7330051 NA 0.04621124 -0.9785049
10 0.935436533 NA NA -1.07365274 NA
11 NA 0.2529093 NA -1.38643245 -1.3389529
12 NA -0.2639166 -0.2301257 NA NA
13 2.026646586 -0.2452684 NA -0.30346521 NA
14 0.522717033 NA NA 1.25870278 NA
15 NA NA -0.9934046 -0.89009964 -0.8403772
16 NA NA 0.0987765 -0.98608109 1.4646301
17 NA 0.7693064 -0.9326388 -0.16240266 NA
18 -0.005393965 NA NA NA -0.8111057
19 NA 1.6241122 -1.1376916 0.15812435 NA
20 NA NA NA 0.71059666 0.5170046

Resources