Combining gsub() and using variable names as columns in R [duplicate] - r
This question already has answers here:
replace Yes, No to 1, 0 in multiple columns in r [duplicate]
(4 answers)
Closed 2 years ago.
I'm hoping that someone can help me :)
I have a data frame with about 1000 columns.
Within that, I have columns named like this:
X1,X2,X3,X4,X5,X6 etc... Y1,Y2,Y3,Y4,Y5,Y6 etc...
df <- data.frame("X1" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"X2" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"X3" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"X4" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"X5" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"X6" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"Y1" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"Y2" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"Y3" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"),
"Y4" = c("Yes","No","Yes","NA","NA","NA","Yes","No","Yes","NA","NA","NA","NA"),
"Y5" = c("Yes","NA","NA","NA","NA","Yes","NA","NA","NA","NA","Yes","NA","NA"),
"Y6" = c("Yes","NA","NA","NA","Yes","No","Yes","NA","Yes","NA","NA","NA", "Yes"))
In certain columns, I replace "Yes" with 1, and "No" with 0, and replace anything else with an NA.
I have tried this:
names = c("X","Y")
for (name in names){
try(
for (j in 1:6){
j <- toString(j)
colname <- paste(name , j, sep="")
df$colname <- gsub("Yes", as.integer(1), df$colname)
df$colname <- gsub("No", as.integer(0), df$colname)
})}
However, this is not working, throwing error message:
Error in `$<-.data.frame`(`*tmp*`, "colname", value = character(0)) : replacement has 0 rows, data has 13
My first question is: Why are the column names not referencing properly?
Second question is: How do I replace anything that's not a 0 or 1 in those columns with an "NA"?
This is possibly a really simple thing that I'm overlooking, but I can't quite figure out how to do it.
Any help would be greatly appreciated.
Many thanks in advance,
Rich
I wouldn't use a loop or gsub here, you can use this:
df[] <- lapply(df, function(x) x <- car::recode(x, "'Yes'=1; 'No'=0; 'NA'=NA"))
This iterates over each column in your dataframe and recodes the values as you want. This is also easier to expand if you get more values in the future.
If you only want certain columns, you can modify it like this:
df[, col_list] <- lapply(df[, col_list], function(x) x <- car::recode(x, "'Yes'=1; 'No'=0; 'NA'=NA"))
Where col_list is the vector of the variables you want to change. You could grep for them using col_list <- grep('^X|Y', names(df), value = T)
Since your data has only 'Yes', 'No' and 'NA' values you can also directly replace them.
#Column numbers to replace
cols <- grep('^[XY]\\d+', names(df))
#Replace "NA" with real NA
df[cols][df[cols] == 'NA'] <- NA
#Replace "Yes" with 1
df[cols][df[cols] == 'Yes'] <- 1
#Replace "No" with 0
df[cols][df[cols] == 'No'] <- 0
#Change dataframe type.
df <- type.convert(df)
df
# X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6
#1 1 1 1 1 1 1 1 1 1 1 1 1
#2 0 NA NA 0 NA NA 0 NA NA 0 NA NA
#3 1 NA NA 1 NA NA 1 NA NA 1 NA NA
#4 NA NA NA NA NA NA NA NA NA NA NA NA
#5 NA NA 1 NA NA 1 NA NA 1 NA NA 1
#6 NA 1 0 NA 1 0 NA 1 0 NA 1 0
#7 1 NA 1 1 NA 1 1 NA 1 1 NA 1
#8 0 NA NA 0 NA NA 0 NA NA 0 NA NA
#9 1 NA 1 1 NA 1 1 NA 1 1 NA 1
#10 NA NA NA NA NA NA NA NA NA NA NA NA
#11 NA 1 NA NA 1 NA NA 1 NA NA 1 NA
#12 NA NA NA NA NA NA NA NA NA NA NA NA
#13 NA NA 1 NA NA 1 NA NA 1 NA NA 1
If you are using R < 4.0.0, you first need to convert data into characters.
df[] <- lapply(df, as.character)
Related
Applying a for-loop to different levels of a variable
I have created a data frame, in the data frame there are 3 sites and I have created a nested for loop to create my desired matrices. THe overall objective is find a more efficient way to do this for each of the 3 sites instead of just the one. The outputs from the nested for loop (EDmatrix and timelags) are the expected results for the other two sites. I would like to find a more efficient way of obtaining these matrices as well as be able to do it for all site instead of just the one in this example. set.seed(123) d1 = sample.int(50, 27) d2 = sample.int(50, 27) d3 = sample.int(50, 27) year <- c(1990:1998) site <- c(rep("a", 9), rep("b", 9), rep("c", 9)) ED = function(x,y){ #x and y are vectors of spp abundances #they must be the same length! if(length(x)!=length(y)) stop("Bad abundances!") out = sqrt(sum((x-y)^2)) out } df <- data.frame(site, year, d1 = d1, d2 = d2, d3 = d3) Here is the code to get the expected output for only a single site, but I would like to be able to do this for all of the sites in the data frame df. subdf = subset(df,site=="a") # subset data for one site EDmatrix = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the dissimilarity values timeLags = matrix(NA,dim(subdf)[1],dim(subdf)[1]) # create a place to store the time lags # First loop through all "j" years from 1 to the total number of years # Now loop through all "k" years from 1 to the total number of years for(j in 1: length(subdf$year)){ for(k in 1: length(subdf$year)){ # grab density data for year "j" jdensity <- subdf[j,-c(1:2)] # grab density data for year "k" kdensity <- subdf[k,-c(1:2)] # calculate and store (in the EDmatrix) the ED value based on the data for year j and k EDmatrix[j,k] <- ED(jdensity, kdensity) # calculate and store (in timeLags) the time lag (the absolute value of the difference # in time between year j and k timeLags[j,k] <- abs(subdf[j, 2] - subdf[k, 2]) }# exit k loop }# exit j loop EDmatrix[lower.tri(EDmatrix, diag=T)]=NA # set duplicate entries to NA timeLags[lower.tri(timeLags, diag=T)]=NA # set duplicate entries to NA y = as.vector(EDmatrix) # turn the matrix into a vector x = as.vector(timeLags)
We may use outer for this operation library(dplyr) library(tidyr) library(purrr) f1 <- function(dat, i, j) { subdat <- dat %>% select(starts_with('d')) jdensity <- subdat[i, ] kdensity <- subdat[j,] EDtmp <- ED(jdensity, kdensity) timetmp <- abs(dat$year[i] - dat$year[j]) tibble(EDtmp, timetmp) } f2 <- function(dat, s1, s2) { mat <- outer(s1, s2, Vectorize(\(i, j) list(f1(dat, i, j)))) EDmatrix <- matrix(map_dbl(mat, ~ .x$EDtmp), length(s1), length(s1)) timeLags <- matrix(map_dbl(mat, ~ .x$timetmp), length(s1), length(s1)) EDmatrix[lower.tri(EDmatrix, diag=TRUE)]=NA timeLags[lower.tri(timeLags, diag=TRUE)]=NA y = as.vector(EDmatrix) x = as.vector(timeLags) tibble(y, x) } out1 <- df %>% group_by(site) %>% summarise(out = f2(cur_data(), row_number(), row_number()), .groups = 'drop') %>% unnest(out) -checking with OP's output > out1$x[out1$site == "a"] [1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3 [49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA > x [1] NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 2 1 NA NA NA NA NA NA NA 3 2 1 NA NA NA NA NA NA 4 3 2 1 NA NA NA NA NA 5 4 3 [49] 2 1 NA NA NA NA 6 5 4 3 2 1 NA NA NA 7 6 5 4 3 2 1 NA NA 8 7 6 5 4 3 2 1 NA > out1$y[out1$site == "a"] [1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA [15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292 [29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA [43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978 [57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937 [71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA > y [1] NA NA NA NA NA NA NA NA NA 30.675723 NA NA NA NA [15] NA NA NA NA 41.388404 18.055470 NA NA NA NA NA NA NA 42.485292 [29] 33.136083 25.729361 NA NA NA NA NA NA 38.288379 41.581246 34.770677 39.433488 NA NA [43] NA NA NA 13.038405 38.379682 49.264592 54.083269 40.865633 NA NA NA NA 16.431677 25.317978 [57] 36.701499 47.549974 36.359318 15.362291 NA NA NA 34.799425 54.680892 54.018515 49.254441 26.019224 35.791060 41.484937 [71] NA NA 9.433981 34.842503 46.108568 42.801869 45.199558 19.924859 25.079872 38.652296 NA
Create many new empty columns in an existing R data frame from a list of column names [duplicate]
I have following dataframe and vector: ddf = data.frame(a=rep(1,10), b=rep(2,10)) xx = c("c", "d", "e", "f") How can I new empty columns which are named with items in xx ? I tried following but it does not work: ddf = cbind(ddf, data.frame(xx)) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 10, 4 Following also does not work: for(i in 1:length(xx)){ ddf$(xx[i]) = "" } Error: unexpected '(' in: "for(i in 1:length(xx)){ ddf$(" } Error: unexpected '}' in "}"
This will get you there: ddf[xx] <- NA # a b c d e f #1 1 2 NA NA NA NA #2 1 2 NA NA NA NA #3 1 2 NA NA NA NA #... You can't directly use something like ddf$xx because this will try to assign to a column called xx rather than interpreting xx. You need to use [ and [<- functions, using the square brackets when you are dealing with a character string/vector - like ddf["columnname"] or ddf[c("col1","col2")], or a stored vector like your ddf[xx]. The reason why it selects columns is because data.frames are lists essentially: is.list(ddf) #[1] TRUE as.list(ddf) #$a # [1] 1 1 1 1 1 1 1 1 1 1 # #$b # [1] 2 2 2 2 2 2 2 2 2 2 ...with each column corresponding to a list entry. So if you don't use a comma to specify a row, like ddf["name",] or a column like ddf[,"name"], you get the column by default. In the case that you are working with a 0-row dataset, you can not use a value like NA as the replacement. Instead, replace with list(character(0)) where character(0) can be substituted for numeric(0), integer(0), logical(0) etc, depending on the class you want for your new columns. ddf <- data.frame(a=character()) xx <- c("c", "d", "e", "f") ddf[xx] <- list(character(0)) ddf #[1] a c d e f #<0 rows> (or 0-length row.names)
This seems to succeed: > cbind(ddf, setNames( lapply(xx, function(x) x=NA), xx) ) a b c d e f 1 1 2 NA NA NA NA 2 1 2 NA NA NA NA 3 1 2 NA NA NA NA 4 1 2 NA NA NA NA 5 1 2 NA NA NA NA 6 1 2 NA NA NA NA 7 1 2 NA NA NA NA 8 1 2 NA NA NA NA 9 1 2 NA NA NA NA 10 1 2 NA NA NA NA
Merge two columns maintaning missing values
I am trying to add two columns. My dataframe is like this one: data <- data.frame(a = c(0,1,NA,0,NA,NA), x = c(NA,NA,NA,NA,1,0), t = c(NA,2,NA,NA,2,0)) I want to add some of the columns like this: yep <- cbind.data.frame( data$a, data$x, rowSums(data[,c(1, 2)], na.rm = TRUE)) However the output looks like this: > yep data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE) 1 0 NA 0 2 1 NA 1 3 NA NA 0 4 0 NA 0 5 NA 1 1 6 NA 0 0 And I would like an oputput like this: > yep data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE) 1 0 NA 0 2 1 NA 1 3 NA NA NA 4 0 NA 0 5 NA 1 1 6 NA 0 0 If the columns contain only NA values I want to leave the NA values. How I could achive this?
Base R: data <- data.frame("a" = c(0,1,NA,0,NA,NA), "x" = c(NA,NA,NA,NA,1,0), "t" = c(NA,2,NA,NA,2,0) ) yep <- cbind.data.frame( data$a, data$x, rs = rowSums(data[,c(1, 2)], na.rm = TRUE)) yep$rs[is.na(data$a) & is.na(data$x)] <- NA yep
Base R (ifelse): cbind(data$a,data$x,ifelse(is.na(data$a) & is.na(data$x),NA,rowSums(data[,1:2],na.rm = TRUE))) If you are looking for the column name then replace cbind with cbind.data.frame Output: [,1] [,2] [,3] [1,] 0 NA 0 [2,] 1 NA 1 [3,] NA NA NA [4,] 0 NA 0 [5,] NA 1 1 [6,] NA 0 0
You might try dplyr::coalesce cbind.data.frame( data$a, data$x, dplyr::coalesce(data$a, data$x)) # data$a data$x dplyr::coalesce(data$a, data$x) #1 0 NA 0 #2 1 NA 1 #3 NA NA NA #4 0 NA 0 #5 NA 1 1 #6 NA 0 0
base r ifelse data[['rowsum']]<-ifelse(is.na(data$a) & is.na(data$x),NA,ifelse(is.na(data$a),0,data$a)+ifelse(is.na(data$x),0,data$x)) a x t rowsum 1: 0 NA NA 0 2: 1 NA 2 1 3: NA NA NA NA 4: 0 NA NA 0 5: NA 1 2 1 6: NA 0 0 0
Another base R approach. If all the values in the rows are NA then return NA or else return sum of the row ignoring NA's. #Select only the columns which we need sub_df <- data[c("a", "x")] sub_df$answer <- ifelse(rowSums(is.na(sub_df)) == ncol(sub_df), NA, rowSums(sub_df, na.rm = TRUE)) sub_df # a x answer #1 0 NA 0 #2 1 NA 1 #3 NA NA NA #4 0 NA 0 #5 NA 1 1 #6 NA 0 0
Subtracting Columns Except When There is NA
I am trying to create a new variable that subtracts two columns only when both columns do not have NA, but has NA whenever one of the columns has NA. When I try to just subtract the columns, I only get a columns of NA. For instance, I am writing the command: d$x3 <- d$x2 - d$x1 When I use the command above, I get: x1 x2 x3 1 3 NA 1 NA NA NA 3 NA NA NA NA Based on looking at some other posts online, I tried to doing a workaround where I changed x1 to negative numbers and then used rowSums command, but then I got this: x3 <- rowSums(df[,c("x1","x2")], na.rm = TRUE) x1 x2 x3 -1 3 2 -1 NA -1 NA 3 3 NA NA 0 What I am trying to produce is: x1 x2 x3 1 3 2 1 NA NA NA 3 NA NA NA NA Thanks for any help! df <- read.table( text="x1 x2 1 3 1 NA NA 3 NA NA", header=T)
randomly delete up to 3 elements per row
I would like to randomly delete up to three elements per row of a data set containing five columns. Below is R code I thought would do it, but it allows up to all five elements in a row to be deleted. This seems basic, but I cannot find the error. Thank you for any advice. set.seed(1234) # create matrix to contain flags identifying elements to be deleted delete.these <- matrix(0, nrow=10, ncol=5) for(i in 1:nrow(delete.these)) { # for each row randomly select the order of the columns # to be tested for deletion rcols <- sample(5, 5, replace = FALSE) for(j in 1:ncol(delete.these)) { # select a random draw delete.it <- runif(1,0,1) # if random draw is below specified threshold and fewer than three # elements have already been deleted from the row then delete element if((delete.it <= 0.7) & sum(delete.these[i,1:5] <= 2)) { delete.these[i,rcols[j]] = 1} if((delete.it > 0.7) | sum(delete.these[i,1:5] >= 3)) { delete.these[i,rcols[j]] = 0} } } delete.these
Instead of using runif() try drawing the indices directly delete.these <- matrix(0, nrow=10, ncol=5) for (i in 1:NROW(delete.these)){ delete.these[i,sample.int(5,sample.int(4,1)-1)] <- 1 } delete.these [,1] [,2] [,3] [,4] [,5] [1,] 1 1 1 0 0 [2,] 0 0 0 0 0 [3,] 0 1 0 1 1 [4,] 0 1 1 0 1 [5,] 1 0 1 0 0 [6,] 0 0 0 0 0 [7,] 1 0 1 0 0 [8,] 0 1 0 1 1 [9,] 0 1 1 0 0 [10,] 1 0 1 0 1 By the way your code doesn't work because of a misplaced paren. sum(delete.these[i,1:5] <= 2) should be instead sum(delete.these[i,1:5]) <= 2
It would be easier (and much faster) to delete with a two column-matrix as an argument to [<-. You did not propose a test case but I will: dfrm <- data.frame(a1=rnorm(20), a2=rnorm(20),a3=rnorm(20), a4=rnorm(20),a5=rnorm(20)) dfrm[ matrix( c( rep(1:20,each=3), replicate(20, {sample(5, 3)} ) ), ncol=2) ] <- NA > dfrm a1 a2 a3 a4 a5 1 NA 0.70871541 NA NA -0.6922827 2 1.9846227 1.70592512 NA NA NA 3 0.2684487 NA 0.0008968694 NA NA 4 NA NA 0.5546355410 0.07399188 NA 5 NA 0.82324761 -0.0410918599 NA NA 6 NA NA -1.0715205164 NA -0.1683819 7 0.0933059 NA NA NA 1.3129301 8 NA 0.79382695 0.1877369725 NA NA 9 0.3124101 NA NA -1.22087347 NA 10 -0.1657043 NA NA 1.36626832 NA 11 NA -0.06095247 -0.9622792102 NA NA 12 NA -1.29243386 -1.2133819819 NA NA 13 -0.0886702 NA NA 0.37495775 NA 14 1.0812527 -1.54215156 NA NA NA 15 NA -0.24765627 NA 0.81374405 NA 16 NA 0.21307051 NA NA -0.6825013 17 -0.4129100 NA NA NA -0.9844177 18 NA 1.95881167 0.7977172969 NA NA 19 NA NA 0.0953287645 NA 1.7067591 20 NA NA -0.1057690912 0.73408897 NA This is assuming that by "delete" you meant set to missing. If the intent were something else you will need to supply a test case and clarify. This (nested sampling strategy will provide a variable number of rows in the indexing matrix per row of the target matrix: idx <- sapply(1:20, function(x) {n<- sample(1:5, sample(1:3,1)) matrix( c(rep(x,length(n)), n), ncol=2) }) # list idx <- do.call(rbind, idx) # now a 2 col matrix dfrm[ idx] <- NA > idx <- sapply(1:20, function(x) {n<- sample(1:5, sample(1:3,1)) + matrix( c(rep(x,length(n)), n), ncol=2) }) # list > idx <- do.call(rbind, idx) # now a 2 col matrix > > dfrm[ idx] <- NA > > dfrm a1 a2 a3 a4 a5 1 -0.048776740 NA 1.1879195 -0.23142932 -3.6185891 2 NA 0.4613289 -0.4532400 -0.85891682 -2.2034714 3 NA NA 1.1191833 1.12545821 NA 4 0.646399767 -0.7126735 2.9474470 0.36358070 NA 5 -0.630929314 1.3770828 NA NA 1.3987857 6 NA NA NA 1.06680025 0.4445383 7 0.484728630 NA 0.7382064 NA 0.9838159 8 -1.558031074 1.1630888 NA NA NA 9 -0.968887379 -0.7330051 NA 0.04621124 -0.9785049 10 0.935436533 NA NA -1.07365274 NA 11 NA 0.2529093 NA -1.38643245 -1.3389529 12 NA -0.2639166 -0.2301257 NA NA 13 2.026646586 -0.2452684 NA -0.30346521 NA 14 0.522717033 NA NA 1.25870278 NA 15 NA NA -0.9934046 -0.89009964 -0.8403772 16 NA NA 0.0987765 -0.98608109 1.4646301 17 NA 0.7693064 -0.9326388 -0.16240266 NA 18 -0.005393965 NA NA NA -0.8111057 19 NA 1.6241122 -1.1376916 0.15812435 NA 20 NA NA NA 0.71059666 0.5170046