Merging multiple columns to one in data frame by sum r - r

I want to merge 7 columns in one column by sum, but i can not find a good way to do this. The data frames contains 71 observations and 7 variables.
The first ones are
> head(df)
pop_exposed_1_1 pop_exposed_1_2 pop_exposed_1_3 pop_exposed_1_4
1 NA NA 15778358 NA
2 NA NA NA NA
3 NA NA NA 3971412
4 NA NA NA 2694625
5 NA NA NA NA
6 NA NA NA NA
pop_exposed_2_2 pop_exposed_2_3 pop_exposed_2_4
1 NA NA NA
2 38044072 NA NA
3 NA NA NA
4 NA NA NA
5 NA 1626335.0 NA
6 NA 429924.4 NA
All the NA values need to be replaced by a value from another variable and some rows have multiple values that need to be combined by sum. So that the outcome is just one variable pop_exposed. I have tried several things, but nothing worked the way I would like to.

Look up ?rowSums
rowSums(df, na.rm=TRUE)
rowMeans(df, na.rm=TRUE)
or the apply way
apply(df,1,sum ,na.rm = TRUE) # Sum by row '1' (for columns use '2')
apply(df,1,mean,na.rm = TRUE) # Mean by row '1' (for columns use '2')

Related

Move data from small data frame to columns in large dataframe with R [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed last year.
I have two data frames in R. There is not an ID of any sort in DF1 to use to map the rows to - I just need the entire column copied over for a data migration.
DF1 has 1349 named columns, and empty rows.
DF2 has 10 named columns and 2990 rows of sample data.
I made a small scale example:
DF1 <- data.frame(matrix(ncol = 10, nrow = 0))
colnames(DF1) <- c('one','two','three','four','five','six','seven','eight','nine','ten')
one <- c(1,54,7,3,6,3)
seven <- c('MLS','Marshall','AAE','JC','AAA','EXE')
DF2 <- data.frame(one,seven)
The column names are the same, but they are not blocked together in DF1 - they are randomly dispersed.
I want to find an efficient way of mapping the 10 columns and all of the rows from DF2 to DF1 without needing to type in each column name, as I will also need to do with with a much larger data frame later.
I expect the rest of the columns in DF1 to be blank/null other than the 'imported' columns from DF2 have been added -- this is okay. Is there an easy way to do this?
Thanks!
dplyr has a nice utility for this:
dplyr::bind_rows(DF1, DF2)
# one two three four five six seven eight nine ten
# 1 1 NA NA NA NA NA MLS NA NA NA
# 2 54 NA NA NA NA NA Marshall NA NA NA
# 3 7 NA NA NA NA NA AAE NA NA NA
# 4 3 NA NA NA NA NA JC NA NA NA
# 5 6 NA NA NA NA NA AAA NA NA NA
# 6 3 NA NA NA NA NA EXE NA NA NA

Adapt layered R data frame so that values of variables match in rows (based on group and date)

I want to research group A's effect on B regarding certain dependent variables I dubbed "target_n". Due to the way in which the data was generated I have "layers" of information in my dataset that are ordered by group. That means, in rows for which Group=="B" I have information on B's values on "target_n" and for rows where Group=="A", I have information on the A's values on "X_n". Group "C" is basically a "other"-category but I would need to have them in the same row as A and B as well to make sure that A's effects are on B and not on C. The following should add some clarity:
My data (df) are structured like this:
df<-data.frame(
"Date"=c(1990-03,2000-01,2010-09,1990-03,2000-01,2010-09,1990-03,2000-01,2010-09),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X_1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X_2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2)
)
What I want is to compute new variables both for group "A" and group "C" so that everything falls within the same rows. If I were to do that manually,I would take A's column "X_1" score at date "1990-03" and assign it to B's place in A's column for the same date.
So in the end, my data would look like this:
df<-data.frame(
"Date"=c(1990,2000,2010,1990,2000,2010,1990,2000,2010),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X_1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X_2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2),
"NEW_X_1_A"=c(NA,NA,NA,9,4,7,NA,NA,NA),
"NEW_X_2_A"=c(NA,NA,NA,1,2,6,NA,NA,NA),
"NEW_target_1_C"=c(NA,NA,NA,5,3,1,NA,NA,NA),
"NEW_target_2_C"=c(NA,NA,NA,1,9,2,NA,NA,NA)
)
(I have a number of these "X_"s and exactly the same number of "target_" variables. I also do not just have this group of A, B and C, but A1,A2,A3,C1,C2,C3 and even more Bs. For each set of A1,B1,C1 I also have a "set" of dates that does not match anothers "set". But that would be less of a problem as I could simply slice my dataset horizontally into sets, do the trick for all of them separately and merge them again.)
But how would I bring A's and C's values into B's rows based on Group=="B" and based on date?
Using data.table you can try
df<-data.frame(
"Date"=c("1990-03","2000-01","2010-09","1990-03","2000-01","2010-09","1990-03","2000-01","2010-09"),
"Group"=c("A","A","A","B","B","B","C","C","C"),
"X1_A"=c(9,4,7,NA,NA,NA,NA,NA,NA),
"X2_A"=c(1,2,6,NA,NA,NA,NA,NA,NA),
"target_value_1_B"=c(NA,NA,NA,0,2,9,NA,NA,NA),
"target_value_2_B"=c(NA,NA,NA,9,2,1,NA,NA,NA),
"target_value_1_C"=c(NA,NA,NA,NA,NA,NA,5,3,1),
"target_value_2_C"=c(NA,NA,NA,NA,NA,NA,1,9,2)
)
library(data.table)
setDT(df)[,`:=` (NEW_X1 = ifelse(Group=="B",X1_A[Group=="A"],NA),
NEW_X2 = ifelse(Group=="B",X2_A[Group=="A"],NA),
NEW_target_value_1_C =ifelse(Group=="B",target_value_1_C[Group=="C"],NA),
NEW_target_value_2_C =ifelse(Group=="B",target_value_2_C[Group=="C"],NA)
)]
Which results in:
df
Date Group X1_A X2_A target_value_1_B target_value_2_B target_value_1_C target_value_2_C NEW_X1 NEW_X2 NEW_target_value_1_C NEW_target_value_2_C
1: 1990-03 A 9 1 NA NA NA NA NA NA NA NA
2: 2000-01 A 4 2 NA NA NA NA NA NA NA NA
3: 2010-09 A 7 6 NA NA NA NA NA NA NA NA
4: 1990-03 B NA NA 0 9 NA NA 9 1 5 1
5: 2000-01 B NA NA 2 2 NA NA 4 2 3 9
6: 2010-09 B NA NA 9 1 NA NA 7 6 1 2
7: 1990-03 C NA NA NA NA 5 1 NA NA NA NA
8: 2000-01 C NA NA NA NA 3 9 NA NA NA NA
9: 2010-09 C NA NA NA NA 1 2 NA NA NA NA

find the row with highest number of NA value in R

I have datafrom
df
1 a c NA NA
2 a a a NA
3 c NA NA NA
Firstly, I want to find which row has the highest number of NA value. I am also interested to find rows with the condition of having more than 2 NA values.
How can I do it in R?
na_rows = rowSums(is.na(df)) gives the count of NA by row. You can then look at which.max(na_rows) and which(na_rows > 2).

ordering a vector in R while ignoring yet keeping NAs

If I have a vector a = c(1300,NA,NA,NA,NA,1500,NA,NA,6000,NA,NA,900)
How can I order this vector to result in:
b=[2,NA,NA,NA,NA,3,NA,NA,4,NA,NA,5]?
Sidenote :I tried to make them repeat so it was
a=[1300,1300,1300,1300,1300,1500,1500,1500,6000,6000,6000,900]
But when I use rank its gets some crazy half numbers, any ideas? I'm at wits end for figuring this out.
Keeping the amount of NAs after a number is very important here so I cant just only ignore them
The dplyr::dense_rank function behaves as you want:
library(dplyr)
dense_rank(a)
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
It also works on the dense vector:
b = c(1300,1300,1300,1300,1300,1500,1500,1500,6000,6000,6000,900)
dense_rank(b)
# [1] 2 2 2 2 2 3 3 3 4 4 4 1
replace(a, !is.na(a), rank(a[!is.na(a)], ties.method = "first"))
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1
Take a ^ is.na(a) and multiply it by rank(a). We use ties="first" to ensure we get increasing values at each index, not averages.
rank(a, ties="first") * a ^ is.na(a)
# [1] 2 NA NA NA NA 3 NA NA 4 NA NA 1

Creating categorical variables from mutually exclusive dummy variables

My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable.
In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here), so I don't think interaction does what I need to do.
For example, my data might look like this:
id conditionA conditionB conditionC conditionD
1 NA 1 NA NA
2 1 NA NA NA
3 NA NA 1 NA
4 NA NA NA 1
5 NA 2 NA NA
6 2 NA NA NA
7 NA NA 2 NA
8 NA NA NA 2
I'd like to now make categorical variables that combine ACROSS different types of conditions. For example, people who had values for condition A and B might be coded with one categorical variable, and people who had values for condition C and D.
id conditionA conditionB conditionC conditionD factor1 factor2
1 NA 1 NA NA 1 NA
2 1 NA NA NA 1 NA
3 NA NA 1 NA NA 1
4 NA NA NA 1 NA 1
5 NA 2 NA NA 2 NA
6 2 NA NA NA 2 NA
7 NA NA 2 NA NA 2
8 NA NA NA 2 NA 2
Right now, I'm doing this using ifelse() statements, which quite simply is a hot mess (and doesn't always work). Please help! There's probably some super-obvious "easier way."
EDIT:
The kinds of ifelse commands that I am using are as follows:
attach(df)
df$factor<-ifelse(conditionA==1 | conditionB==1, 1, NA)
df$factor<-ifelse(conditionA==2 | conditionB==2, 2, df$factor)
In reality, I'm combining across 6-8 columns each time, so a more elegant solution would help a lot.
Update (2019): Please use dplyr::coalesce(), it works pretty much the same.
My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:
#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)
df$factor1 <- with(df, coalesce.na(conditionA, conditionB))
(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)
Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:
df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0),
coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')
I think this function gives you what you need (admittedly, this is a quick hack).
to_indicator <- function(x, grp)
{
apply(tbl, 1,
function (x)
{
idx <- which(!is.na(x))
nm <- names(idx)
if (nm %in% grp)
x[idx]
else
NA
})
}
And here is it's used with the example data you provide.
tbl <- read.table(header=TRUE, text="
conditionA conditionB conditionC conditionD
NA 1 NA NA
1 NA NA NA
NA NA 1 NA
NA NA NA 1
NA 2 NA NA
2 NA NA NA
NA NA 2 NA
NA NA NA 2")
tbl <- data.frame(tbl)
(tbl <- cbind(tbl,
factor1=to_indicator(tbl, c("conditionA", "conditionB")),
factor2=to_indicator(tbl, c("conditionC", "conditionD"))))
Well, I think you can do it simply with ifelse, something like :
factor1 <- ifelse(is.na(conditionA), conditionB, conditionA)
Another way could be :
factor1 <- conditionA
factor1[is.na(factor1)] <- conditionB
And a third solution, certainly more pratical if you have more than two columns conditions :
factor1 <- apply(df[,c("conditionA","conditionB")], 1, sum, na.rm=TRUE)

Resources