copy values from different columns based on conditions (r code) - r

I have data like one in the picture where there are two columns (Cday,Dday) with some missing values.
There can't be a row where there are values for both columns; there's a value on either one column or the other or in neither.
I want to create the column "new" that has copied values from whichever column there was a number.
Really appreciate any help!

Since no row has a value for both, you can just sum up the two existing columns. Assume your dataframe is called df.
df$'new' = rowSums(df[,2:3], na.rm=T)
This will sum the rows, removing NAs and should give you what you want. (Note: you may need to adjust column numbering if you have more columns than what you've shown).

The dplyr package has the coalesce function.
library(dplyr)
df <- data.frame(id=1:8, Cday=c(1,2,NA,NA,3,NA,2,NA), Dday=c(NA,NA,NA,3,NA,2,NA,1))
new <- df %>% mutate(new = coalesce(Dday, Cday, na.rm=T))
new
# id Cday Dday new
#1 1 1 NA 1
#2 2 2 NA 2
#3 3 NA NA NA
#4 4 NA 3 3
#5 5 3 NA 3
#6 6 NA 2 2
#7 7 2 NA 2
#8 8 NA 1 1

Related

Accounting for NA using Pivot_longer in R

I'm trying to pivot_longer 34 columns of a data set with about 10,000 rows in R. The data was collected via survey, and each column represents a possible answer to a question. I want to pivot_longer one of the questions, which had 34 possible answers, and account for 34/107 columns. The columns have a value (1) if that answer was selected, and the other 33 rows have NA.
Example subset of data frame for a question with 5 possible answers (df):
ID A B C D E
1 1 NA NA NA NA
2 NA 1 NA NA NA
3 NA NA NA NA 1
4 NA NA NA NA NA
5 NA 1 NA NA NA
I need to get to:
ID Answer
1 A
2 B
3 E
4 NA
5 B
I want to pivot_longer the results to this question, while maintaining all the other columns. The issue occurs because some people didn't answer this question, resulting in all NA's (See row 4).
I'm using the code:
dfNew <- pivot_longer(df, c(A,B,C,D,E), names_to = "Answer", values_drop_na = TRUE)
dfNew
ID Answer
1 A
2 B
3 E
5 B
Which removes ID 4 from the data. Not using values_drop_na results in having a row for every NA value in A:E. How do I get it to maintain ID 4 as part of the data set, and make the value for Answer NA?
You can use complete to fill the missing values :
library(tidyr)
pivot_longer(df, A:E, names_to = "Answer", values_drop_na = TRUE) %>%
complete(ID = unique(df$ID)) %>%
dplyr::select(-value)
# A tibble: 5 x 2
# ID Answer
# <int> <chr>
#1 1 A
#2 2 B
#3 3 E
#4 4 NA
#5 5 B
You can also use max.col here :
cbind(df[1], answer = names(df)[-1][max.col(!is.na(df[-1])) *
NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)])
This might be quite difficult to understand.
max.col(!is.na(df[-1])) returns the index of non-NA value in each row but in case the row has all NA's it returns any index.
NA^ !rowSums(!is.na(df[-1])) this part returns NA for rows where there are all NA's and 1 for rows which has atleast 1 non-NA.
When we multiply 1 * 2 we get NA's for all NA's row and row-index where there is a value.
max.col(!is.na(df[-1])) * NA^ !rowSums(!is.na(df[-1]), na.rm = TRUE)
#[1] 1 2 5 NA 2
4 . We use these (above) values to subset column names from df to get answer.
names(df[-1])[max.col(!is.na(df[-1]))*NA^!rowSums(!is.na(df[-1]), na.rm = TRUE)]
#[1] "A" "B" "E" NA "B"

Shifting rows up in a particular column of data

I have a question about shifting of rows in the particular column of a data.
data <- data.frame(B=c(NA,NA,0,NA,NA,0),C=c(1,NA,NA,1,NA,NA))
B C
1 NA 1
2 NA NA
3 0 NA
4 NA 1
5 NA NA
6 0 NA
I tried from this post Shifting a column down by one
na.omit(transform(data, B = c(NA, B[-nrow(data)])))
but only get
B C
4 0 1
expected output;
B C
1 0 1
2 0 1
How can we achieve that ?
Thanks.
If you want to remove all NA from each column and do not care that the rows will not match between columns you can do:
data <- data.frame(B=c(NA,NA,0,NA,NA,0),C=c(1,NA,NA,1,NA,NA))
res<-lapply(data,function(x){x[complete.cases(x)]})
res<-data.frame(res)
the second line says: for every column in data keep only the values which are not NA
Thanks to #thelatemail for the correction from the solution below, which worked, but would have kept the columns as factors:
data <- data.frame(B=c(NA,NA,0,NA,NA,0),C=c(1,NA,NA,1,NA,NA))
res<-apply(data,2,function(x){x[complete.cases(x)]})

R: how to join the duplicate rows in one dataframe

I have one dataframe with some duplicated rows, which I want to join only duplicated rows. Given an example below:
name b c d
1 yp 3 NA NA
2 yp 3 1 NA
3 IG NA 3 NA
4 OG 4 1 0
the duplicated rows are defined by the rows which have the same name. Thus in this example, row 1 and row 2 need to be join somehow, with the NA values replaced by possible numerical value.
name b c d
1 yp 3 1 NA
2 IG NA 3 NA
3 OG 4 1 0
Assumption: if two rows have the same name, and their corresponding columns are not NA, then the corresponding column values must be the same numerical value.
Here's a dplyr approach:
library(dplyr)
df %>% group_by(name) %>% summarise_each(funs(first(.[!is.na(.)])))
#Source: local data frame [3 x 4]
#
# name b c d
# (fctr) (int) (int) (int)
#1 IG NA 3 NA
#2 OG 4 1 0
#3 yp 3 1 NA
This groups the data by "name" and for each unique name, returns a single row and in each of the other columns returns the first value that is not NA or, NA if all entries are NAs. This is in line with the assumption that if several numerical values are present, they must all be the same (and hence, we can pick the first one).
Perhaps you can try something like the following:
setDT(mydf)[, lapply(.SD, function(x) {
if (all(is.na(x))) NA else x[!is.na(x)][1]
}), by = name]
# name b c d
# 1: yp 3 1 NA
# 2: IG NA 3 NA
# 3: OG 4 1 0
Basically, if all values are NA, just take the the first NA value, or else, take the first non-NA value.
As pointed out by #docendodiscimus, this can be simplified to:
setDT(mydf)[, lapply(.SD, function(x) x[!is.na(x)][1]), by = name]
A quick way to solve this would be to use the dplyr package and group the on the variables you want to join on and then handle how to join the rows.
A good way to join the rows could be to take the mean of all but the NA values.
In your case the code would be:
library(dplyr)
df %>% group_by(name) %>%
summarise_each(funs(mean, "mean", mean(., na.rm = TRUE)))

add multiple columns to matrix based on value in existing column

I am looking for a way to add 3 values in 3 different columns to a matrix based on the value in an existing column.
experiment = rbind(1,1,1,2,2,2,3,3,3)
newColumns = matrix(NA,dim(experiment)[1],3) # make 3 columns of length experiment filled with NA
experiment = cbind(experiment,newColumns) # add new columns to the experimental data
experiment = data.frame(experiment)
experiment[experiment[,1]==1,2:4] = cbind(0,1,2) # add 3 columns at once
experiment$new[experiment[,1]==2] = 5 # add a single column
print(experiment)
X1 X2 X3 X4 new
1 1 0 0 0 NA
2 1 1 1 1 NA
3 1 2 2 2 NA
4 2 NA NA NA 5
5 2 NA NA NA 5
6 2 NA NA NA 5
7 3 NA NA NA NA
8 3 NA NA NA NA
9 3 NA NA NA NA
this, however, fills the new columns the wrong way. I want column 2 to be all 0's, column 3 to be all 1's and column 4 to be all 3's.
I know I can do it 1 column at a time, but my real dataset is quit large so that isn't my preferred solution. I would like to be able to easily add more columns just by making the range of columns larger and adding values to the 3 values in the example
Instead of this:
experiment[experiment[,1]==1,2:4] = cbind(0,1,2) # add 3 columns at once
Try this:
experiment[experiment[,1] == 1, 2:4] <- rep(c(0:2), each=3)
The problem is that you've provided 3 values (0,1,2) to fill 9 entries. The values are by default filled column-wise. So, the first column is filled with 0, 1, 2 and then the values get recycled. So, it goes again 0,1,2 and 0,1,2. Since you want 0,0,0,1,1,1,2,2,2, you should explicitly generate using rep(0:2, each=3) (the each does the task of generating the data shown just above).

How to match 1 column to 2 columns?

I'm trying to match numbers from one column to numbers in two other columns. I can do this just fine when matching to only a single column, but have problems extending to two columns. Here is what I am doing:
I have 2 dataframes, df1:
number value
1
2
3
4
5
and df2:
number_a number_b value
3 3
1 5
5 1
4 2
2 4
What I want to do is match column "number" from df1 to EITHER "number_a" or number_b" in df2, then insert "value" from df2 into "value" of df1, to give the result df1 as:
number value
1 5
2 4
3 3
4 2
5 1
My approach is to use
df1$value <- df2$value[match(df1$number, df2$number_a)]
or
df1$value <- df2$value[match(df1$number, df2$number_b)]
which yields, respectively, for df1
number value
1 NA
2 NA
3 3
4 NA
5 1
and
number value
1 5
2 4
3 NA
4 2
5 NA
However, I can't seem to fill in all of the "value" column in df1 using this approach. How can I match "number" to "number_a" and "number_b" in one fell swoop. I tried
df1$value <- df2$value[match(df1$number, df2$number_a:number_b)]
but that didn't work.
Thanks!
Easier solution:
df2$number <- ifelse(is.na(df2$number_a), df2$number_b, df2$number_a)
If you're not familiar with ifelse, it works with vectors in the form:
ifelse(Condition, ValueIfTrue, ValueIfFalse)
I am a newbie to R (coming from several years with C). Was trying out the suggestions and I thought I would paste what I came up with:
// Assuming either 'number_a' or 'number_b' is valid
// Combine into new column 'number' and delete them original columns
df2 <- transform(df2, number = ifelse(is.na(df2$number_a), df2$number_b,
df2$number_a))[-c(1:2)]
// Combine the two data frames by the column 'number'
df <- merge(df1, df2, by = "number")
number value
1 5
2 4
3 3
4 2
5 1

Resources