Conditional statement in R dataframe

Conditional statement in R dataframe - r

I have dataframe df as below.
dput(df)
structure(list(X = c(1, 2, 5, 7, 8), Y = c(3, 5, 8, 7, 2), Z = c(2,
8, 7, 4, 3), R = c(6, 6, 6, 6, 66)), .Names = c("X", "Y", "Z",
"R"), row.names = c(NA, -5L), class = "data.frame")
df
class(df)
I have to modify df under two conditions.
First:
modify df so that it check minimum between X,Y,Z for each row and whichever is minimum get replaced with corresponding value of R.
Second case:
which is minimum between X,Y,Z,R in each row, it get replaced with maximum between X,Y,Z,and R and create a new df.
How should i get that?
I tried ifelse and if and else but could not get what i want..
Any help would be appreciated.

You can create a new dataset "df1" with first three coumns of "df". Multiply "df1" with "-1" so that maximum values become "min" (assuming that there are no negative values). Here, in the example, the values were all unique per row. So, you can use the function max.col and specify the ties.method='first'. It will get you the index of maximum value (here it will be minimum) per row, cbind it will the 1:nrow(df) to create the "row/column" index and extract the elements of "df1" based on that index (df1[cbind..]) and change those values to "R" column values (<- df$R). You could then change the original "df" columns ("df[1:3]") to new values. If there are more than one "minimum" value per row, you could use the "loop" method described for the second case.
df1 <- df[1:3]
df1[cbind(1:nrow(df),max.col(-1*df1, 'first'))] <- df$R
df[1:3] <- df1
df
# X Y Z R
#1 6 3 2 6
#2 6 5 8 6
#3 6 8 7 6
#4 7 7 6 6
#5 8 66 3 66
Create a copy of "df" (df2), get the max values per row using pmax, loop over the rows of "df2" (sapply(seq_len...)) and change the "minimum" values in each row to corresponding "max" values ("MaxV"), transpose (t) and assign it back to the "df2" (df2[])
df2 <- df
#only use this if there is only a single "minimum" value per row
# and no negative values in the data
#df2[cbind(1:nrow(df), max.col(-1*df2, 'first'))] <-
# do.call(pmax, df2)
MaxV <- do.call(pmax, df2)
df2 [] <- t(sapply(seq_len(nrow(df2)), function(i) {
x <- unlist(df2[i,])
ifelse(x==min(x), MaxV[i], x)}))
df2
# X Y Z R
#1 6 3 6 6
#2 6 8 8 6
#3 8 8 7 8
#4 7 7 7 7
#5 8 66 66 66

Related

Assign value to a specific rows (R)

I have a df of 16k+ items. I want to assign values A, B and C to those items based.
Example:
I have the following df with 10 unique items
df <- c(1:10)
Now I have three separate vectors (A, B, C) that contain row numbers of the df with values A, B or C.
A <- c(3, 9)
B <- c(2, 6, 8)
C <- c(1, 4, 5, 7, 10)
Now I want to add a new category column to the df and assign values A, B and C based on the row numbers that are in the three vectors that I have. For example, I would like to assign value C to rows 1, 4, 5, 7 and 10 of the df.
I tried to experiment with for loops and if statements to match the value of the vector with the row number of the df but I didn't succeed. Can anybody help out?

Here is a way to assign the new column.
Create the data frame and a list of vectors:
df <- data.frame(n=1:10)
dat <- list( A=c(3, 9), B=c(2, 6, 8), C=c(1, 4, 5, 7, 10) )
Put the data in the desired rows:
df$new[unlist(dat)] <- sub("[0-9].*$","",names(unlist(dat)))
Result:
df
n new
1 1 C
2 2 B
3 3 A
4 4 C
5 5 C
6 6 B
7 7 C
8 8 B
9 9 A
10 10 C

You could iterate over the names of a list and assign those names to the positions indexed by the successive sets of numeric values:
dat <- list(A=A,B=B,C=C)
for(i in names(dat)){ df$new[ dat[[i]] ] <- i}

Identifying columns with high correlation in large dataset

I have two large dataframes (50+ columns and many are long character vars) and I need to identify the "link" variable that I should use to merge them together. The problem is the name of the variables don't match up. That is I need to identify variables in the two datasets where the values have a high correlation.
As an example :
dta1 = data.frame(A = c(1 , 2,3, 4), B = c( 23, 45, 6, 8), C = c("001", "028", "076", "039"))
dta2 = data.frame(first = c(5, 6, 7, 8), second = c( 58, 32, 33, 45), third = c("008", "028", "076", "039"))
I would like the code to tell me that columns C and third have a very high correlation (they are not complete duplicates though!).
I have tried adding the two dataframes and running a cor() function, but this doesn't work with character variables.
Also tried union_all(x, y, ...) from dplyr but that requires the same column names.
At this point I am out of ideas.
Thanks very much.

To identify the columns most similar, try the following. It systematically compares the values from each column in dta1 with the columns in dta2. It returns a matrix.
sapply(dta1, function(x) sapply(dta2, function(y) sum(x == y)))
A B C
first 0 1 0
second 0 0 0
third 0 0 3
From here we can see that third and C have the most matches. Now you can join your two data.frames. To keep all rows and columns, you will want a full_join from the dplyr package.
library(dplyr)
full_join(dta1, dta2, by = c("C" = "third"))
A B C first second
1 1 23 001 NA NA
2 2 45 028 6 32
3 3 6 076 7 33
4 4 8 039 8 45
5 NA NA 008 5 58

how to combine multiple columns with grep and sum the values in r

I have following dataframe in r
Engine General Ladder.winch engine.phe subm.gear.box aux.engine pipeline.maintain pipeline pipe.line engine.mpd
1 12 22 2 4 2 4 5 6 7
and so on with more than 10000 rows.
Now,I want to combine columns and add values to reduce the columns into broader categories. e.g Engine,engine.phe,aux.engine,engine.mpd should be combined into Engine category and all the values to be added. likewise pipeline.maintain,pipeline,pipe.line to be combined into Pipeline And rest columns to be added under General Category.
Desired dataframe would be
Engine Pipeline General
12 15 38
How can I do it in r?

Many ways in which you can do it, this is a more straight forward approach
# Example data.frame
dtf <- structure(list(Engine = c(1, 0, 1),
General = c(12, 3, 15), Ladder.winch = c(22, 28, 26),
engine.phe = c(2, 1, 0), subm.gear.box = c(4, 4, 10),
aux.engine = c(2, 3, 1), pipeline.maintain = c(4, 5, 1),
pipeline = c(5, 5, 2), pipe.line = c(6, 8, 2), engine.mpd = c(7, 8, 19)),
.Names = c("Engine", "General", "Ladder.winch", "engine.phe",
"subm.gear.box", "aux.engine", "pipeline.maintain",
"pipeline", "pipe.line", "engine.mpd"),
row.names = c(NA, -3L), class = "data.frame")
with(dtf, data.frame(Engine=Engine+engine.phe+aux.engine+engine.mpd,
Pipeline=pipeline.maintain+pipeline+pipe.line,
General=General+Ladder.winch+subm.gear.box))
# Engine Pipeline General
# 1 12 15 38
# 2 12 18 35
# 3 21 5 51
# a more generalized and 'greppy' solution
cnames <- tolower(colnames(dtf))
data.frame(Engine=rowSums(dtf[, grep("eng", cnames)]),
Pipeline=rowSums(dtf[, grep("pip", cnames)]),
General=rowSums(dtf[, !grepl("eng|pip", cnames)]))

Here is an option by extracting the concerned words from the names of the column, and using tapply to get the sum. The str_extract_all returns a list ('lst'). Replace those elements which are having zero length with 'GENERAL', Then, using a group by function i.e. tapply, unlist the dataset, and use the grouping variables i.e replicated 'lst' and the row of 'df1' get the sum
library(stringr)
lst <- str_extract_all(toupper(sub("(pipe)\\.", "\\1", names(df1))),
"ENGINE|PIPELINE|GENERAL")
lst[lengths(lst)==0] <- "GENERAL"
t(tapply(unlist(df1), list(unlist(lst)[col(df1)], row(df1)), FUN = sum))
# ENGINE GENERAL PIPELINE
#1 12 38 15

It is mostly better to store you data in long format. Therefore, my proposal would to approach your problem as below:
1 - get your data in long format
library(reshape2)
dfl <- melt(df)
2 - create 'engine' and 'pipeline'-vectors
e_vec <- c("Engine","engine.phe","aux.engine","engine.mpd")
p_vec <- c("pipeline.maintain","pipeline","pipe.line")
3 - create a category column
dfl$newcat <- c("general","engine","pipeline")[1 + dfl$variable %in% e_vec + 2*(dfl$variable %in% p_vec)]
The result:
> dfl
variable value newcat
1 Engine 1 engine
2 General 12 general
3 Ladder.winch 22 general
4 engine.phe 2 engine
5 subm.gear.box 4 general
6 aux.engine 2 engine
7 pipeline.maintain 4 pipeline
8 pipeline 5 pipeline
9 pipe.line 6 pipeline
10 engine.mpd 7 engine
Now you can use aggregate to get the final result:
> aggregate(value ~ newcat, dfl, sum)
newcat value
1 engine 12
2 general 38
3 pipeline 15

myfactors = ifelse(grepl("engine", names(df), ignore.case = TRUE), "Engine",
ifelse(grepl("pipe|pipeline", names(df), ignore.case = TRUE), "Pipeline",
"General"))
data.frame(lapply(split.default(df, myfactors), rowSums))
# Engine General Pipeline
#1 12 38 15
#2 12 35 18
#3 21 51 5
df is the data from this answer

Matching dataframe columns: one int and another is list

Trying to create a column in dataframe df1 based on match in another dataframe df2, where df1 is much bigger than df2:
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]
This doesn't quite work because df2$IDs column is a list:
> df2
IDs val2
1 0 1
2 1, 2 2
3 3, 4 3
4 5, 6 4
5 7, 8 5
6 9, 10 6
7 11, 12, 13, 14 7
It only works for the part where the list has 1 element (row 1: ..$ : int 0 above). For all other rows the 'match(df1$id, df2$IDs)' returns NA.
Test of matching some individual numbers works just fine with double brackets:
2 %in% df2[[2,'IDs']]
So, I either need to modify the column df2$IDs or need to perform match operation differently. The df1 has many other columns, so does the df2, but df2 is much shorter in rows.
The case can be reproduced with the following:
IDs <- c("[0]", "[1, 2]", "[3, 4]", "[5, 6]", "[7, 8]", "[9, 10]", "[11, 12, 13, 14]")
val2 <- c(1,2,3,4,5,6,7)
df2 <- data.frame(IDs, val2)
df2$IDs <- lapply(strsplit(as.character(df2$IDs), ','), function (x) as.integer(gsub("\\s|\\[|\\]", "", x)))
id <- floor(runif(100, min=0, max=15))
df1 <- data.frame(id)
str(df1)
str(df2)
df1$val2 <- df2$val2[match(df1$id, df2$IDs)]

List columns are clumsy to work with. If you convert df2 to a more vanilla format, it works:
DF2 = with(df2, data.frame(ID = unlist(IDs), val2 = rep(val2, lengths(IDs))))
df1$m = DF2$val2[ match(df1$id, DF2$ID) ]
If you want list columns just for browsing, it is quick to do...
aggregate(ID ~ ., DF2, list)
val2 ID
1 1 0
2 2 1, 2
3 3 3, 4
4 4 5, 6
5 5 7, 8
6 6 9, 10
7 7 11, 12, 13, 14
.
Fyi, the match approach will not extend naturally to joining on more columns, so you might want to eventually learn data.table and its "update join" syntax for this case:
library(data.table)
setDT(df1); setDT(df2)
DT2 = df2[, .(ID = unlist(IDs)), by=setdiff(names(df2), "IDs")]
df1[DT2, on=.(id = ID), v := i.val2 ]

Select row numbers of a data frame conditioning on another data frame

I have a data frame that I want to find the row numbers where these rows are in common with another data frame.
To make the question clear, say I have data frame A and data frame B:
dfA <- data.frame(NAME = rep(c("a", "b"), each = 3),
TRIAL = rep(1:3, 2),
DATA = runif(6))
dfB <- data.frame(NAME = c("a", "b"),
TRIAL = c(2, 3))
dfA
# NAME TRIAL DATA
# 1 a 1 0.62948592
# 2 a 2 0.88041819
# 3 a 3 0.02479411
# 4 b 1 0.48031827
# 5 b 2 0.86591315
# 6 b 3 0.93448264
dfB
# NAME TRIAL
# 1 a 2
# 2 b 3
I want to get dfA's row number where dfA and dfB have the same NAME and TRIAL, in this case, row numbers are 2 and 6.
I tried the following code, gives me row 2, 3, 5, 6. It separately matches NAME and TRIAL, doesn't work.
which(dfA$NAME %in% dfB$NAME & dfA$TRIAL %in% dfB$TRIAL)
# 2 3 5 6
Then I tried to create a dummy column and match this col. Works, but the code would be verbose if dfB has many columns...
dfA$dummy <- paste0(dfA$NAME, dfA$TRIAL)
dfB$dummy <- paste0(dfB$NAME, dfB$TRIAL)
which(dfA$dummy %in% dfB$dummy)
# 2 6
I'm wondering if there are better ways to solve the problem, thanks for your help!

You can do:
merge(transform(dfA, row.num = 1:nrow(dfA)), dfB)$row.num
# [1] 2 6
And if the whole goal of finding the indices is so that you can subset dfA, then you can just do merge(dfA, dfB).

Or use duplicated:
apply(dfB, 1, function(x)
which(duplicated(rbind(x, dfA[1:2])))-1)
# [1] 2 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Conditional statement in R dataframe - r

Related

Assign value to a specific rows (R)

Identifying columns with high correlation in large dataset

how to combine multiple columns with grep and sum the values in r

Matching dataframe columns: one int and another is list

Select row numbers of a data frame conditioning on another data frame

Categories

Resources