How to create a new column with if condition - r

This seems simple but I could not perform. Its different than sound similar question ask here.
I want to create new columns say df$col1, df$col2, df$col3 on dataframe df using if condition in the column already exists ie df$con and df$val.
I would like to write the value of column "val" in df$col1 if df$con > 3
I would like to write the value of col df$val in df$col2 if df$con<2
and I would like to write the 30% of df$val in df$col3 if df$con is between 1 and 3.
How should I do it ? Below is my dataframe df with two columns "con" for condition and "val" for value use.
dput(df)
structure(list(con = c(-33.09524956, -36.120924, -28.7020053,
-26.06385399, -18.45731163, -14.51817928, -20.1005132, -23.62346403,
-24.90464018, -23.51471516), val = c(0.016808197, 1.821442227,
4.078385886, 3.763593573, 2.617612605, 2.691796601, 1.060565469,
0.416400183, 0.348732675, 1.185505136)), .Names = c("con", "val"
), row.names = c(NA, 10L), class = "data.frame")

This might do it. First we write a function to change FALSE values to NA
foo <- function(x) {
is.na(x) <- x == FALSE
return(x)
}
Then apply it over the list of logical vectors and take the matching val column values
df[paste0("col", 1:3)] <- with(df, {
x <- list(con > 3, con < 2, con < 3 & con > 1)
lapply(x, function(y) val[foo(y)])
})
resulting in
df
con val col1 col2 col3
1 -33.09525 0.0168082 NA 0.0168082 NA
2 -36.12092 1.8214422 NA 1.8214422 NA
3 -28.70201 4.0783859 NA 4.0783859 NA
4 -26.06385 3.7635936 NA 3.7635936 NA
5 -18.45731 2.6176126 NA 2.6176126 NA
6 -14.51818 2.6917966 NA 2.6917966 NA
7 -20.10051 1.0605655 NA 1.0605655 NA
8 -23.62346 0.4164002 NA 0.4164002 NA
9 -24.90464 0.3487327 NA 0.3487327 NA
10 -23.51472 1.1855051 NA 1.1855051 NA

Could go the tidyverse way. The pipes %>% just send the output of each operation to the next function. mutate allows you to make a new column in your data frame, but you have to remember to store it at the top. It's stored as output. The ifelse allows you to conditionally assign values to your new column, for example the column col1. The second argument in ifelse is the output for a true condition, and the third argument is when ifelse is false. Hope this helps some too!
Go tidyverse!
library(tidyverse)
output <- df %>%
mutate(col1=ifelse(con>3, val, NA)) %>%
mutate(col2=ifelse(con<2, val, NA)) %>%
mutate(col3=ifelse(con<=3 & con>=1, 0.3*val, NA))
Here's a df that actually meets some of the conditions:
structure(list(con = c(-33.09524956, 2.5, -28.7020053, 2, -18.45731163,
2, -20.1005132, 6, -24.90464018, -23.51471516), val = c(0.016808197,
1.821442227, 4.078385886, 3.763593573, 2.617612605, 2.691796601,
1.060565469, 0.416400183, 0.348732675, 1.185505136)), .Names = c("con",
"val"), row.names = c(NA, 10L), class = "data.frame")
Here's the output after running the code:
con val col1 col2 col3
1 -33.09525 0.0168082 NA 0.0168082 NA
2 2.50000 1.8214422 NA NA 0.5464327
3 -28.70201 4.0783859 NA 4.0783859 NA
4 2.00000 3.7635936 NA NA 1.1290781
5 -18.45731 2.6176126 NA 2.6176126 NA
6 2.00000 2.6917966 NA NA 0.8075390
7 -20.10051 1.0605655 NA 1.0605655 NA
8 6.00000 0.4164002 0.4164002 NA NA
9 -24.90464 0.3487327 NA 0.3487327 NA
10 -23.51472 1.1855051 NA 1.1855051 NA

Related

How do I remove letters from numeric cells so that I can make the column entirely numeric? (R)

I've got a dataframe with a column full of pixel coordinates. I want to remove the 'px' from these values so that I can make the entire column numeric without introducing NAs.
> print(data_exp_59965_v11_task_j84b$`X Coordinate`)
[1] NA NA NA NA NA NA NA NA NA
[10] NA NA NA NA NA NA NA NA NA
[19] NA NA "-401.222px" "401.222px" "-200.611px" "347.458px" "200.611px" "347.458px" "-200.611px"
[28] "-347.458px" "200.611px" "-347.458px" NA
library(tidyverse)
data_exp_59965_v11_task_j84b %>%
mutate(`X Coordinate` = as.numeric(str_replace_all(`X Coordinate`, "px$", "")))
Output
X Coordinate
1 -401.222
2 401.222
3 -200.611
4 NA
5 NA
Data
data_exp_59965_v11_task_j84b <- structure(list(`X Coordinate` = c("-401.222px", "401.222px",
"-200.611px", NA, NA)), class = "data.frame", row.names = c(NA,
-5L))
You could use sub:
df$`X Coordinate` <- as.numeric(sub("px$", "", df$`X Coordinate`, fixed=TRUE))
More generally, you might try:
df$`X Coordinate` <- as.numeric(sub(".*?(-?\\d+(?:\\.\\d+)?).*", "\\1", df$`X Coordinate`))
This option would capture every number, excluding any remaining content.
Perfect use case for parse_number from readr it is in tidyverse:
Data from #AndrewGB (many thanks)
library(dplyr)
library(readr)
data_exp_59965_v11_task_j84b %>%
mutate(`X Coordinate` = parse_number(`X Coordinate`))
X Coordinate
1 -401.222
2 401.222
3 -200.611
4 NA
5 NA

Select columns, excluding some which are all NA

Suppose I have this dataframe
df <- data.frame(keep = c(1, NA, 2),
also_want = c(NA, NA, NA),
maybe = c(1, 2, NA),
maybe_2 = c(NA, NA, NA))
Edit: In the actual dataframe there are many columns I'd like to keep, so spelling them all out isn't viable. These columns are all the columns that do not start with maybe. The maybe columns, instead, do have a common naming like maybe, maybe_1 etc. that could work with grep or stringr::str_detect
I want to select keep, and also_want. I also want any of the maybe columns that have values other than NA
desired_df
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
I can use select_if to get all columns that have non-NA values, but then I lose also_want
library(dplyr)
df %>%
select_if(~sum(!is.na(.)) > 0)
keep maybe
1 1 1
2 NA 2
3 2 NA
Thoughts?
With dplyr 1.0.0 you can use the where function inside a select statement to test for conditions that your variables have to satisfy, but first you specify the variables you also want to keep.
EDIT
I've inserted the condition that only the "maybe" variables have to contain values other than NA; before, we select every column that does not start with "maybe".
df %>%
select(!starts_with("maybe"), starts_with("maybe") & where(~sum(!is.na(.)) > 0))
Output
# keep also_want maybe
# 1 1 NA 1
# 2 NA NA 2
# 3 2 NA NA
following your comments, in Base-R we can use
df[,!apply(
rbind(
grepl("maybe",colnames(df)),
!apply(df, 2, function(x) !all(is.na(x)))
)
,2,all)]
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA
Or if you prefer seeing the same code all on 1 line:
df[,!apply(rbind(grepl("maybe",colnames(df)),!apply(df, 2, function(x) !all(is.na(x)))),2,all)]
I eventually figured this out. Using str_detect to select all non-maybe columns, and then using a one-liner inside sapply to also select any other columns (i.e. any maybe columns) that have non-NA values.
library(dplyr)
library(stringr)
df %>%
select_if(stringr::str_detect(names(.), "maybe", negate = TRUE) |
sapply(., function(x) {
sum(!is.na(x))
} > 0))
keep also_want maybe
1 1 NA 1
2 NA NA 2
3 2 NA NA

Why does R 'sample' some columns more than others?

I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
This makes sense to me. As #Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).
If you want random samples with guaranteed ratios, try this:
guaranteedSampling <- function(DataFrame, ProportionRemove) {
n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
DataFrame
}
set.seed(2)
guaranteedSampling(DF[2:5], 0.8)
# B C D E
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 6.792463 10.582938 NA NA
# 5 NA NA -0.612816 NA
# 6 NA -2.278758 NA NA
# 7 NA NA NA 2.245884
# 8 NA NA NA 5.993387
# 9 7.863310 NA 9.042127 NA
# 10 NA NA NA NA
Further to #joran's comment, you either wanted nrow(DataFrame) or length(x)
The specific impact in your example is that you are producing a vector with 5 elements (because DF has 5 variables) each with 0.8 probability of being NA and 0.2 of being TRUE.
Then this statement (which is what the sapply is doing to each column you specify and in this case I'm applying to DF$B only):
DF$B[sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)]
does something that isn't immediately obvious to the uninitiated*. This:
sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)
gives a logical vector, which when used to extract elements of a vector is silently recycled. So lets say you end up with:
NA TRUE NA TRUE NA
When you subset DF$B you end up getting this:
DF$B[c(NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA)]
Notice in your example how the top 5 numbers always follow the same pattern as the bottom 5 numbers. This explains why so many columns ended up being all NA, because there is a 0.32768 probability of getting 5 out of 5 NA which gets recycled to the whole column.
The other issue with your code is that the function doesn't actually do anything useful because you didn't specify any return value. Here it is corrected and cleaned up and using http://adv-r.had.co.nz/Style.html:
random_sample <- function(x, col_start, col_end, p) {
sapply(x[col_start:col_end],
function(y) y[sample(c(TRUE, NA), prob = c(1-p, p), size = length(y), replace = TRUE)])
}
*The uninitiated in this case includes me! I had no idea that logical vectors were recycled when used to extract until having a look at this question.

passing positive results from multiple columns into a single new column in r

I am trying to work out a way to create a single column from multiple columns in R. What I want to do is for R to go through all rows for multiple columns and if it finds a positive result in one of those columns, to pass that result into an 'amalgam' column (sorry I don't know a better word for it).
See the toy dataset below
x <- c(NA, NA, NA, NA, NA, 1)
y <- c(NA, NA, 1, NA, NA, NA)
z <- c(NA, 1, NA, NA, NA, NA)
df <- data.frame(cbind(x, y, z))
df[, "compCol"] <- NA
df
x y z compCol
1 NA NA NA NA
2 NA NA 1 NA
3 NA 1 NA NA
4 NA NA NA NA
5 NA NA NA NA
6 1 NA NA NA
I need to pass positive results from each of the columns into the compCol column while changing negative results to 0. So that it looks like this.
x y z compCol
1 NA NA NA 0
2 NA NA 1 3
3 NA 1 NA 2
4 NA NA NA 0
5 NA NA NA 0
6 1 NA NA 1
I know if probably requires an if else statement nested inside a for loop but all the ways I have tried result in errors that I don't understand.
I tried the following just for a single column
for (i in 1:length(x)) {
if (df$x[i] == 1) {
df$compCol[i] <- df$x[i]
}
}
But it didn't work at all.
I got the message 'Error in if (df$x[i] == 1) { : missing value where TRUE/FALSE needed'
And that makes sense but I can't see where to put the TRUE/FALSE statement
You can also use reshaping with NA removal
library(dplyr)
library(tidyr)
df.id = df %>% mutate(ID = 1:n() )
df.id %>%
gather(variable, value,
x, y, z,
na.rm = TRUE) %>%
left_join(df.id)
We can use max.col. Create a logical matrix by checking whether the selected columns are greater than 0 and are not NA ('ind'). We use max.col to get the column index for each row and multiply with rowSums of 'ind' so that if there is 0 TRUE values for a row, it will be 0.
ind <- df > 0 & !is.na(df)
df$compCol <- max.col(ind) *rowSums(ind)
df$compCol
#[1] 0 3 2 0 0 1
Or another option is pmax after multiplying with the col(df)
do.call(pmax,col(df)*replace(df, is.na(df), 0))
#[1] 0 3 2 0 0 1
NOTE: I used the dataset before creating the 'compCol' in the OP's post.

Merging multiple data frames without getting duplicates

I am trying to merge 6+ datasets into one by ID. Right now, the duplication of IDs makes merge treat each as a new observation.
Example code:
combined <-Reduce(function(x,y) merge(x,y, all=TRUE), list(NRa,NRb,NRc,NRd,NRe,NRf,NRg,NRh))
Which gives me this:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA NA 1 NA
3 62734839 NA NA NA 1 NA NA
4 62737229 NA 1 NA NA NA NA
5 62737229 NA NA NA 1 1 NA
I would like each ID to have a single row:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA 1 1 NA
3 62737229 NA 1 NA 1 1 NA
Any help is appreciated. Thank you.
Using R's sqldf package will work leaving you with one id per row.
Data1 <- data.frame(
X = sample(1:10),
Housing = sample(c("yes", "no"), 10, replace = TRUE)
)
Data2 <- data.frame(
X = sample(1:10),
Credit = sample(c("yes", "no"), 10, replace = TRUE)
)
Data3 <- data.frame(
X = sample(1:10),
OwnsCar = sample(c("yes", "no"), 10, replace = TRUE)
)
Data4 <- data.frame(
X = sample(1:10),
CollegeGrad = sample(c("yes", "no"), 10, replace = TRUE)
)
library(sqldf)
sqldf("Select Data1.X,Data1.Housing,Data2.Credit,Data3.OwnsCar,Data4.CollegeGrad from Data1
inner join Data2 on Data1.X = Data2.X
inner join Data3 on Data1.X = Data3.X
inner join Data4 on Data1.X = Data4.X
")
Why don't you try by='ID' in your merge() function. If that's not enough, try aggregate().
Your description of the problem is not entirely clear, and you don't provide data.
Assuming that all of your dataframes have the same dimensions, column names, column orders, ID entries, that the ID row orders match, that ID is the first column, that all other entries are either NA or 1 and that any cell in one dataframe featuring a 1 has NA values in that cell for all other data frames or that sums of numeric values are acceptable, and that you want the result as a data frame ...
An Old-School solution using the abind package:
consolidate <- function(lst) {
stopifnot(require(abind))
## form 3D array, replace NA
x <- abind(lst, along=3)
x[is.na(x)] <- 0
z <- x[,,1] ## data store
## sum array along 3rd dimension
for (j in seq(2,ncol(x)))
for (i in seq(nrow(x)))
z[i,j] <- sum(x[i,j,])
z[z==0] <- NA ## restore NA
as.data.frame(z)
}
For dataframes (with the above caveats) a,b,c:
consolidate(list(a,b,c))

Resources