r: Manipulate data so that columns with same values combine in particular ways - r

I have a dataframe where each column is made up of zero along with one other number. For example:
I want to manipulate the dataframe so that columns that contain the same other number become one column where the value stays as the other number if the other number was present in every row, otherwise it turns to zero.
So for instance, I would want the dataframe above to look like
..1 ..2 ..3
1 2 3
0 2 0
0 0 0
1 0 0
The first row of the dataframe is 1 because the values were both 1 in the first row of the original. The second row of the first column is 0 because there were a 1 and a 0 in the row.
Here is some reproducible data:
structure(list(...1 = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ...2 = c(1, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0), ...3 = c(2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ...4 = c(3,
0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0,
0, 0, 0, 0, 0, 0), ...5 = c(3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ...6 = c(3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-28L), class = "data.frame")

Here is a possible solution in base R, where dat is the data frame you provide in the question. We find the unique value for each column, assuming there is only one nonzero value in each column. Then we loop through the groups of columns with each unique value, applying the function all() to each row of the subsetted dataframe to identify rows with all nonzero values. Multiply the resulting logical vector by the value itself to get the desired result. Then store this vector in a list and bind to a data frame.
col_vals <- apply(dat, 2, max)
columns <- list()
for (val in unique(col_vals)) {
columns[[length(columns) + 1]] <- val * apply(dat[, col_vals == val, drop = FALSE], 1, all)
}
as.data.frame(do.call(cbind, columns))

Related

R for loop wise : Rowwise sum on conditions : Performance issue

I have a database, where I am running code to change value of a cell-based on the sum of previous cells and the sum of succeeding cells in the same row.
for (i in 1:row1)
{
for(j in 3:col-1)
{ # for-loop over columns
if (as.numeric(rowSums(e[i,2:j])) == 0 )
{
e1[i,j] <- 0
}
else if (as.numeric(rowSums(e[i,2:j])) > 0 && e[i,j] == 0 && as.numeric(rowSums(e[i,j:col])) > 0 )
{
e1[i,j] <- 1
}
else if (as.numeric(rowSums(e[i,2:j])) > 0 && e[i,j] == 1 && as.numeric(rowSums(e[i,j:col])) > 0 )
{
e1[i,j] <- 0
}
}
}
The runtime is very high. Appreciate any suggestions to improve the speed. Additional info: copying new values into the data frame is being done.
Thanks,
Sandy
edit 2:
Sample data:
structure(list(`Sr no` = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19), `2018-01` = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2018-02` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2018-03` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2018-04` = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2018-05` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0), `2018-06` = c(0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0), `2018-07` = c(0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0), `2018-08` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1), `2018-09` = c(0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0), `2018-10` = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1), `2018-11` = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1), `2018-12` = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2019-01` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2019-02` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-19L), class = c("tbl_df", "tbl", "data.frame"))
I think you can do this with matrix logic. Depends if you have enough RAM.
# creating fake data
# nc <- 300 # number of columns
nc <- 10 # for testing
nn <- 1e6 # rows
e <- sapply(1:nc, function(x) sample.int(2, nn, replace = T) - 1L)
e <- as.data.frame(e)
row1 <- nrow(e)
colc <- ncol(e)
# note that:
3:colc-1
# isnt equal with:
3:(colc-1)
s <- 3:(colc-1) # I assume you meant this
e1 <- matrix(nrow = row1, ncol = length(s)) # empty resulting matrix
s1 <- sapply(s, function(j) rowSums(e[, 2:j])) # sum for each relevant i,j
s2 <- sapply(s, function(j) rowSums(e[, j:colc])) # sum for each relevant i,j
e2 <- as.matrix(e[, s]) # taking relevant columns of e
e1[s1 == 0] <- 0
e1[s1 > 0 & e2 == 0 & s2 > 0] <- 1
e1[s1 > 0 & e2 == 1 & s2 > 0] <- 0

Translating SAS language to R language: Creating a new variable

I have a sas code and I want to translate into R. I am interested in creating variables based on the conditions of other variables.
data wp;
set wp;
if totalcriteria =>3 and nonecom=0 then content=1;
if totalcriteria =>3 and nonecom=1 then content=0;
if totalcriteria <3 and nonecom=0 then content=0;
if totalcriteria <3 and nonecom=1 then content=0;
run;
This is a code I have in. My conditions for "content" as changed and I would like to translate the sas code to R to hopefully replace the "mutate" line of the code below or fit in with the code below:
wpnew <- wp %>%
mutate(content = ifelse (as.numeric(totalcriteria >= 3),1,0))%>%
group_by(district) %>%
summarise(totalreports =n(),
totalcontent = sum(content),
per.content=totalcontent/totalreports*100)
Can you help me translate this SAS code to R language. Thank you in advance.
Here is the dput output
structure(list(Finances = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), Exercise = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), Relationships = c(0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0), Laugh = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 1), Gratitude = c(0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1), Regrets = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0), Meditate = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), Clutter = c(0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 0), Headache = c(0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 0), Loss = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0), Anger = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0), Difficulty = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), nonecom = c(1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
1, 0, 1, 1, 0), Othercon = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), totalcriteria = c(0, 0, 2, 3, 2, 0, 0, 4, 3,
0, 0, 0, 3, 0, 0, 2)), class = "data.frame", row.names = c(NA,
-16L))
This is what I would like it to look like
V1 V2 V3...V12 nonecom Othercon totalcriteria content
1 1 1 0 1 0 3 0
0 0 1 0 0 0 8 1
1 0 0 0 0 1 2 0
1 0 1 0 1 0 1 0
I use case_when just because I find it more similar in terms of syntax. Your current approach only tests the first part of the IF condition, not the second part regarding nonecom.
wpnew <- wp %>%
mutate(content = case_when(sum.content >= 3 & nonecom == 0 ~ 1,
TRUE ~ 0))

Replace character in a df for numeric vector in R

I would like to replace characters for specifics numeric vector.
I have this df:
First Second Third
A C D
F R K
and I also have vectors like these
A = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
R = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
N = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
I have tried several times but I can't do it. Does anyone have some advice or idea?
An option would be to unlist (convert to character if it is factor) and then use mget to return the values for that object in a list
lst1 <- mget(as.character(unlist(df)))

How to use ids from one dataframe to sum rows in another dataframe

I feel like this answer has been asked before, but I can't seem to find an answer to this question. Maybe my title is too vague, so feel free to change it.
So I have one data frame, a, with ids the correspond to column name in data frame b. Both data frames are simplified versions of a much larger data frame.
here is data frame a
a <- structure(list(V1 = structure(c(4L, 5L, 1L, 2L, 3L), .Label = c("GEN[D00105].GT",
"GEN[D00151].GT", "GEN[D00188].GT", "GEN[D86396].GT", "GEN[D86397].GT"
), class = "factor")), row.names = c(NA, -5L), class = "data.frame")
here is data frame b
b <- structure(list(`GEN[D01104].GT` = c(0, 0, 0, 0, 1, 0, 0, 2, 0,
1, 1, 1, 1, 0, 0, 0, 2, 0, 0, 0), `GEN[D01312].GT` = c(1, 0,
2, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 2, 0, 0, 0), `GEN[D01878].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 2, 0, 0), `GEN[D01882].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0), `GEN[D01952].GT` = c(0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0), `GEN[D01953].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 2, 0), `GEN[D02053].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0), `GEN[D00316].GT` = c(0,
0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2, 0, 0), `GEN[D01827].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0), `GEN[D01881].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0), `GEN[D02044].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0), `GEN[D02085].GT` = c(0,
0, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0), `GEN[D02204].GT` = c(0,
0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0), `GEN[D02276].GT` = c(0,
0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0), `GEN[D02297].GT` = c(0,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0), `GEN[D02335].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 2, 0, 0), `GEN[D02397].GT` = c(0,
0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0), `GEN[D00856].GT` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0), `GEN[D00426].GT` = c(0,
0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0), `GEN[D02139].GT` = c(0,
0, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0), `GEN[D02168].GT` = c(0,
0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0)), row.names = c(NA,
-20L), class = "data.frame")
I want to be able to use the ids from data frame a to sum the row in data frame b that have a matching id if that makes sense.
So in the past, I just did something like
b$affected.samples <- (b$`GEN[D86396].GT` + b$`GEN[D86397].GT` + b$`GEN[D00105].GT` + b$`GEN[D00151].GT` + b$`GEN[D00188].GT`)
which got annoying and took to much time, so I moved over to
b$affected.samples <- rowSums(b[,c(1:5)])
Which isn't too bad for this example but with my large data set, my sample can be all over the place, and it's starting to take too much time to finds where everything is. I was hoping there is a way just to use my data frame a to sum the correct rows in data frame b.
Hopefully, I gave this is all the information you need! Let me know if you have any questions.
Thanks in advance!!
Extract the 'V1' column as a character string, use that to select the columns of 'b' (assuming these column names are found in 'b') and get the rowSums
rowSums( b[as.character(a$V1)], na.rm = TRUE)

how do I rebuild data frame based on columns identified in a numeric vector?

I'm using R to complete some GA driven searches.
Returned from my GA script is the resulting chromosome, returned as a binary numeric of length 40.
An example is: c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0).
I also have a corresponding data frame with 40 columns.
Using the data in the numeric vector, how do I efficiently build a (or re-build the) data frame so that it contains only those columns represented by the 1's in my numeric vector?
Building a sample data.frame and assigning your sample vector to x:
df <- as.data.frame(matrix(sample(1:100, 400, replace=T), ncol=40))
x <- c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)
I can subset:
df[ ,x==1]
or:
df[, as.logical(x)]

Resources