New conditioned column in R data - r

I'm taking a data mining course and need to manipulate some data to do desired task using randomForest. V1, V2, and V3 are the column names. If V1=A and V2=2, I want R to output "Eureka" to the corresponding row of a new column V4. I want the other values in V4 to be set to "NOPE". The actual data set has 300000 rows and 6 columns. This may seem strange but if I can learn how to do this my problem will be solved. Thanks.
V1 V2 V3
A 1 4
A 1 8
A 2 4
A 2 8
C 1 10
C 1 9
C 2 10
C 2 9
V1 V2 V3 V4
A 1 4 NOPE
A 1 8 NOPE
A 2 5 Eureka
A 2 3 Eureka
C 1 10 NOPE
C 1 8 NOPE
C 2 10 NOPE
C 2 4 NOPE
The following code does NOT work.
`for(g in 1:8){
if(data$V1[g]=="A"&data$V2[g]==2){
data$V4[g]=Eureka
}else{
data$V4[g]="NOPE"
}
}`

We could use either numeric index or ifelse to create the "V4" column. V1=='A' & V2==2 gives a logical index (TRUE/FALSE). Adding 1, coerces the logical vector to binary (1/0) and gives 2/1 corresponding to TRUE/FALSE. This numeric values can be used as index to replace it with `NOPE'/'Eureka'.
df$V4 <- with(df, c('NOPE', 'Eureka')[(V1=='A' & V2==2)+1])
df
# V1 V2 V3 V4
#1 A 1 4 NOPE
#2 A 1 8 NOPE
#3 A 2 4 Eureka
#4 A 2 8 Eureka
#5 C 1 10 NOPE
#6 C 1 9 NOPE
#7 C 2 10 NOPE
#8 C 2 9 NOPE
Or using ifelse
df$V4 <- with(df, ifelse(V1=='A' & V2==2, 'Eureka', 'NOPE'))
Another option would be data.table. Convert the "data.frame" to "data.table" (setDT). Create column (V4) with value NOPE. The rows of V4 that meets the condition (V1=='A' & V2==2) is assigned to Eureka
library(data.table)
setDT(df)[,V4:='NOPE'][V1=='A' & V2==2, V4:='Eureka'][]
Regarding the error in your code, 'Eureka' should be quoted. It is better to use vectorized methods rather than loops.
for(g in 1:8){
if(df$V1[g]=='A' & df$V2[g]==2){
df$V4[g] <- 'Eureka'
}
else{
df$V4[g] <- 'NOPE'
}
}
df$V4
#[1] "NOPE" "NOPE" "Eureka" "Eureka" "NOPE" "NOPE" "NOPE" "NOPE"
data
df <- structure(list(V1 = c("A", "A", "A", "A", "C", "C", "C", "C"),
V2 = c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), V3 = c(4L, 8L, 4L,
8L, 10L, 9L, 10L, 9L)), .Names = c("V1", "V2", "V3"), class =
"data.frame", row.names = c(NA, -8L))

Set the vector to the defautl value, then replaced the correct instances with the new value using logical indexing.
data$V4 <- "Nope"
data$V4[ data$V1[g]=="A" & data$V2[g]==2] <- "Eureka"

Related

Summation of the corresponding number of values which are in different columns

My data frame looks like below:
df<-data.frame(alphabets1=c("A","B","C","B","C"," ","NA"),alphabets2=c("B","A","D","D"," ","E","NA"),alphabets3=c("C","F","G"," "," "," ","NA"), number = c("1","2","3","1","4","1","2"))
alphabets1 alphabets2 alphabets3 number
1 A B C 1
2 B A F 2
3 C D G 3
4 B D 1
5 C 4
6 E 1
7 NA NA NA 2
NOTE1: within the row all the values are unique, that is, below shown is not possible.
alphabets1 alphabets2 alphabets3 number
1 A A C 1
NOTE2: data frame may contains NA or is blank
I am struggling to get the below output: which is nothing but a dataframe which has the alphabets and the sum of their corresponding numbers, that is A alphabet is in 1st and 2nd rows so its sum of its corresponding number is 1+2 i.e 3 and let's say B, its in 1st, 2nd and 4th row so the sum will be 1+2+1 i.e 4.
output <-data.frame(alphabets1=c("A","B","C","D","E","F","G"), number = c("3","4","8","4","1","2","3"))
output
alphabets number
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
NOTE3: output may or may not have the NA or blanks (it doesn't matter!)
We can reshape it to 'long' format and do a group by operation
library(data.table)
melt(setDT(df), id.var="number", na.rm = TRUE, value.name = "alphabets1")[
!grepl("^\\s*$", alphabets1), .(number = sum(as.integer(as.character(number)))),
alphabets1]
# alphabets1 number
#1: A 3
#2: B 4
#3: C 8
#4: D 4
#5: E 1
#6: F 2
#7: G 3
Or we can use xtabs from base R
xtabs(number~alphabets1, data.frame(alphabets1 = unlist(df[-4]),
number = as.numeric(as.character(df[,4]))))
NOTE: In the OP's dataset, the missing values were "NA", and not real NA and the 'number' column is factor (which was changed by converting to integer for doing the sum)
data
df <- data.frame(alphabets1=c("A","B","C","B","C"," ",NA),
alphabets2=c("B","A","D","D"," ","E",NA),
alphabets3=c("C","F","G"," "," "," ",NA),
number = c("1","2","3","1","4","1","2"))
Here is a base R method using sapply and table. I first converted df$number into a numeric. See data section below.
data.frame(table(sapply(df[-length(df)], function(i) rep(i, df$number))))
Var1 Freq
1 11
2 A 3
3 B 4
4 C 8
5 D 4
6 E 1
7 F 2
8 G 3
9 NA 6
To make the output a little bit nicer, we could wrap a few more functions and perform a subsetting within sapply.
data.frame(table(droplevels(unlist(sapply(df[-length(df)],
function(i) rep(i[i %in% LETTERS],
df$number[i %in% LETTERS])),
use.names=FALSE))))
Var1 Freq
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
It may be easier to do this afterward, though.
data
I ran
df$number <- as.numeric(df$number)
on the OP's data resulting in this.
df <-
structure(list(alphabets1 = structure(c(2L, 3L, 4L, 3L, 4L, 1L,
5L), .Label = c(" ", "A", "B", "C", "NA"), class = "factor"),
alphabets2 = structure(c(3L, 2L, 4L, 4L, 1L, 5L, 6L), .Label = c(" ",
"A", "B", "D", "E", "NA"), class = "factor"), alphabets3 = structure(c(2L,
3L, 4L, 1L, 1L, 1L, 5L), .Label = c(" ", "C", "F", "G", "NA"
), class = "factor"), number = c(1, 2, 3, 1, 4, 1, 2)), .Names = c("alphabets1",
"alphabets2", "alphabets3", "number"), row.names = c(NA, -7L), class = "data.frame")

How to combine two rows in R?

I would like to combine/sum two rows based on rownames to make one row in R. The best route might be to create a new row and sum the two rows together.
Example df:
A 1 3 4 6
B 3 2 7 9
C 6 8 1 2
D 3 2 8 9
Where A,B,C,D are rownames, I want to combine/sum two rows (A & C) into one to get:
A+C 7 11 5 8
B 3 2 7 9
D 3 2 8 9
Thank you.
aggregate to the rescue:
aggregate(df, list(Group=replace(rownames(df),rownames(df) %in% c("A","C"), "A&C")), sum)
# Group V2 V3 V4 V5
#1 A&C 7 11 5 8
#2 B 3 2 7 9
#3 D 3 2 8 9
You can replace the A row using the standard addition arithmetic operator, and then remove the C row with a logical statement.
df["A", ] <- df["A", ] + df["C", ]
df[rownames(df) != "C", ]
# V2 V3 V4 V5
# A 7 11 5 8
# B 3 2 7 9
# D 3 2 8 9
For more than two rows, you can use colSums() for the addition. This presumes the first value in nm is the one we are replacing/keeping.
nm <- c("A", "C")
df[nm[1], ] <- colSums(df[nm, ])
df[!rownames(df) %in% nm[-1], ]
I'll leave it up to you to change the row names. :)
Data:
df <- structure(list(V2 = c(1L, 3L, 6L, 3L), V3 = c(3L, 2L, 8L, 2L),
V4 = c(4L, 7L, 1L, 8L), V5 = c(6L, 9L, 2L, 9L)), .Names = c("V2",
"V3", "V4", "V5"), class = "data.frame", row.names = c("A", "B",
"C", "D"))
matrix multiply?
> A <- matrix(c(1,0,0,0,1,0,1,0,0,0,0,1), 3)
> A
[,1] [,2] [,3] [,4]
[1,] 1 0 1 0
[2,] 0 1 0 0
[3,] 0 0 0 1
> A %*% X
V2 V3 V4 V5
[1,] 7 11 5 8
[2,] 3 2 7 9
[3,] 3 2 8 9
Or using the Matrix package for sparse matrices:
fac2sparse(factor(c(1,2,1,4))) %*% X

Fill a column's blank spaces contingent on a second column in R

I'd appreciate some help with this one. I have something similar to the data below.
df$A df$B
1 .
1 .
1 .
1 6
2 .
2 .
2 7
What I need to do is fill in df$B with each value that corresponds to the end of the run of values in df$A. Example below.
df$A df$B
1 6
1 6
1 6
1 6
2 7
2 7
2 7
Any help would be welcome.
It seems to me that the missing values are denoted by .. It is better to read the dataset with na.strings="." so that the missing values will be NA. For the current dataset, the 'B' column would be character/factor class (depending upon whether you used stringsAsFactors=FALSE/TRUE (default) in the read.table/read.csv.
Using data.table, we convert the data.frame to data.table (setDT(df1)), change the 'character' class to 'numeric' (B:= as.numeric(B)). This will also result in coercing the . to NA (a warning will appear). Grouped by "A", we change the "B" values to the last element (B:= B[.N])
library(data.table)
setDT(df1)[,B:= as.numeric(B)][,B:=B[.N] , by = A]
# A B
#1: 1 6
#2: 1 6
#3: 1 6
#4: 1 6
#5: 2 7
#6: 2 7
#7: 2 7
Or with dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(B= as.numeric(tail(B,1)))
Or using ave from base R
df1$B <- with(df1, as.numeric(ave(B, A, FUN=function(x) tail(x,1))))
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), B = c(".",
".", ".", "6", ".", ".", "7")), .Names = c("A", "B"),
class = "data.frame", row.names = c(NA, -7L))

Manipulating a Data frame in R

I am new in R. I have data frame
A 5 8 9 6
B 8 2 3 6
C 1 8 9 5
I want to make
A 5
A 8
A 9
A 6
B 8
B 2
B 3
B 6
C 1
C 8
C 9
C 5
I have a big data file
Assuming you're starting with something like this:
mydf <- structure(list(V1 = c("A", "B", "C"), V2 = c(5L, 8L, 1L),
V3 = c(8L, 2L, 8L), V4 = c(9L, 3L, 9L),
V5 = c(6L, 6L, 5L)),
.Names = c("V1", "V2", "V3", "V4", "V5"),
class = "data.frame", row.names = c(NA, -3L))
mydf
# V1 V2 V3 V4 V5
# 1 A 5 8 9 6
# 2 B 8 2 3 6
# 3 C 1 8 9 5
Try one of the following:
library(reshape2)
melt(mydf, 1)
Or
cbind(mydf[1], stack(mydf[-1]))
Or
library(splitstackshape)
merged.stack(mydf, var.stubs = "V[2-5]", sep = "var.stubs")
The name pattern in the last example is unlikely to be applicable to your actual data though.
Someone could probably do this in a better way but here I go...
I put your data into a data frame called data
#repeat the value in the first column (c - 1) times were c is the number of columns (data[1,])
rep(data[,1], each=length(data[1,])-1)
#turning your data frame into a matrix allows you then turn it into a vector...
#transpose the matrix because the vector concatenates columns rather than rows
as.vector(t(as.matrix(data[,2:5])))
#combining these ideas you get...
data.frame(col1=rep(data[,1], each=length(data[1,])-1),
col2=as.vector(t(as.matrix(data[,2:5]))))
If you could use a matrix you can just 'cast' it to a vector and add the row names. I have assumed that you really want 'a', 'b', 'c' as row names.
n <- 3;
data <- matrix(1:9, ncol = n);
data <- t(t(as.vector(data)));
rownames(data) <- rep(letters[1:3], each = n);
If you want to keep the rownames from your first data frame this is ofcourse also possible without libraries.
n <- 3;
data <- matrix(1:9, ncol=n);
names <- rownames(data);
data <- t(t(as.vector(data)))
rownames(data) <- rep(names, each = n)

R: collapse rows and then convert row into a new column

So here is my challenge. I am trying to get rid of rows of data that are best organized as a column. The original data set looks like
1|1|a
2|3|b
2|5|c
1|4|d
1|2|e
10|10|f
And the end result desired is
1 |1,2,4 |a| e d
2 |3,5 |b| c
10|10 |f| NA
The table's shaping is based from minimum value Col 2 within groupings of Col 1, where new column 3 is defined from the minimum values within the group and new column 4 is collapsed from not the minimum of. Some of the approaches tried include:
newTable[min(newTable[,(1%o%2)]),] ## returns the minimum of both COL 1 and 2 only
ddply(newTable,"V1", summarize, newCol = paste(V7,collapse = " ")) ## collapses all values by Col 1 and creates a new column nicely.
Variations to combine these lines of code into a single line have not worked, in part to my limited knowledge. These modifications are not included here.
Try:
library(dplyr)
library(tidyr)
dat %>%
group_by(V1) %>%
summarise_each(funs(paste(sort(.), collapse=","))) %>%
extract(V3, c("V3", "V4"), "(.),?(.*)")
gives the output
# V1 V2 V3 V4
#1 1 1,2,4 a d,e
#2 2 3,5 b c
#3 10 10 f
Or using aggregate and str_split_fixed
res1 <- aggregate(.~ V1, data=dat, FUN=function(x) paste(sort(x), collapse=","))
library(stringr)
res1[, paste0("V", 3:4)] <- as.data.frame(str_split_fixed(res1$V3, ",", 2),
stringsAsFactors=FALSE)
If you need NA for missing values
res1[res1==''] <- NA
res1
# V1 V2 V3 V4
#1 1 1,2,4 a d,e
#2 2 3,5 b c
#3 10 10 f <NA>
data
dat <- structure(list(V1 = c(1L, 2L, 2L, 1L, 1L, 10L), V2 = c(1L, 3L,
5L, 4L, 2L, 10L), V3 = c("a", "b", "c", "d", "e", "f")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -6L))
Here's an approach using data.table, with data from #akrun's post:
It might be useful to store the columns as list instead of pasting them together.
require(data.table) ## 1.9.2+
setDT(dat)[order(V1, V2), list(V2=list(V2), V3=V3[1L], V4=list(V3[-1L])), by=V1]
# V1 V2 V3 V4
# 1: 1 1,2,4 a e,d
# 2: 2 3,5 b c
# 3: 10 10 f
setDT(dat) converts the data.frame to data.table, by reference (without copying it). Then, we sort it by columns V1,V2 and group by V1 column on the sorted data, and for each group, we create the columns V2, V3 and V4 as shown.
V2 and V4 will be of type list here. If you'd rather have a character column where all entries are pasted together, just replace list(.) with paste(., sep=...).
HTH

Resources