I have a data.table table with about 50,000 rows. There are two columns. There are lots of "/NA" in it.
Example:
V1 V2
A 1
B 2
A 1
C 3
A NA
B 2
C 3
A /NA
B /NA
A 1
I want to get
V1 V2
A 1
B 2
A 1
C 3
A 1
B 2
C 3
A 1
B 2
A 1
How can I finish it?
Thank you so much, Justin
tf <- tempfile()
writeLines(" V1 V2
A A
B B
A A
C C
A NA
B B
C C
A /NA
B /NA
A A", tf )
x <- read.table(tf, header=T, stringsAsFactors = F)
x$V2 <- ifelse(gsub("[/]","", x$V2) == "NA" | is.na(x$V2), x$V1, x$V2)
R> x
V1 V2
1 A A
2 B B
3 A A
4 C C
5 A A
6 B B
7 C C
8 A A
9 B B
10 A A
edit
A second ifelse() clause (or switch) is needed for the new question to parse V1 to V2. Note that I've switched the initial clause's evaluation via !:
x$V2 <- ifelse(!(gsub("[/]","", x$V2) == "NA" | is.na(x$V2)), x$V2,
ifelse(x$V1 == "A", 1, ifelse(x$V1 == "B", 2,3)))
You can use the data frame in R to get the same result
example <- data.frame(V1 = c("A","B","A","C","A","B","C","A","B","A"),
V2=c(1,2,1,3,"NA",2,3,"/NA","/NA",1), stringsAsFactors = FALSE)
example <- within(example, V2[V1=="A" & (V2=="NA" | V2=="/NA")] <-1)
example <- within(example, V2[V1=="B" & (V2=="NA" | V2=="/NA")] <-2)
example <- within(example, V2[V1=="C" & (V2=="NA" | V2=="/NA")] <-3)
Related
I have a list of data tables stored in an object ddf (a sample is shown below):
[[43]]
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b b a
6: b c a
7: b c a
[[44]]
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
[[45]]
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
8: a c b
9: a c b
.............and so on till [[100]]
I want to Subset the list ddf such that the result only consists of ddf's which:
have at least 9 rows each
each of the 9 rows are same
I want to store this sub-setted output
I have written some code for this below:
for(i in 1:100){
m=(as.numeric(nrow(df[[i]]))>= 9)
if(m == TRUE & df[[i]][1,] = df[[i]][2,] =
=df[[i]][3,] =df[[i]][4,] =df[[i]][5,] =df[[i]][6,]=
df[[i]][7,]=df[[i]][8,]=df[[i]][9,]){
print(df[[i]])
}}
Please tell me whats wrong & how I can generalize the result for sub-setting based on "n" similar rows.
[Follow-up Question]
Answer obtained from Main question:
> ddf[sapply(ddf, function(x) nrow(x) >= n & nrow(unique(x)) == 1)]
$`61`
V1 V2 V3
1: a c b
2: a c b
3: a c b
4: a c b
5: a c b
6: a c b
7: a c b
$`68`
V1 V2 V3
1: a c a
2: a c a
3: a c a
4: a c a
5: a c a
6: a c a
7: a c a
8: a c a
$`91`
V1 V2 V3
1: b c a
2: b c a
3: b c a
4: b c a
5: b c a
6: b c a
7: b c a
..... till the last data.frame which meet the row matching criteria (of at least 9 similar rows)
There are only 2 types of elements in the list:
**[[.. ]]**
**Case 1.** >70% accuracy
**Case 2.** <70% accuracy
You will notice that the Output shown above in the "Follow Up Question" is for
$'61', $'68' & $'91', but there is NO output for the other dataframes which don't match the "matching row" criteria.
I need an output where these missing values which don't match the output criteria give an output of "bad output".
Thus the Final list should be the same length as the input list.
By placing them side-by-side using paste I should be able to see each output.
We can loop through the list ('ddf'), subset only the duplicate rows with (duplicated), order the dataset, if the number of rows of the dataset 'x1' is greater than 8, then get the first 9 rows (head(x1, 9)) or else return 'bad result' printed
lapply(ddf, function(x) {
x1 <- x[duplicated(x)|duplicated(x, fromLast=TRUE)]
if(nrow(x1)>9) {
x1[order(V1, V2, V3), head(.SD, 9)]
} else "bad answer"
})
#[[1]]
# V1 V2 V3
#1: b c a
#2: b c a
#3: b c a
#4: b c a
#5: b c a
#6: b c a
#7: b c a
#8: b c a
#9: b c a
#[[2]]
#[1] "bad answer"
#[[3]]
#[1] "bad answer"
data
ddf <- list(data.table(V1 = 'b', V2 = rep(c('c', 'b', 'c'), c(8, 1, 2)), V3 = 'a'),
data.table(V1 = rep("a", 5), V2 = rep("c", 5), V3 = rep("a", 5)),
data.table(V1 = c('b', 'a', 'b', 'b'), V2 = c('b', 'a', 'c', 'b'),
V3 = c("c", "d", "a", "b")))
When ddf is your list of datatables, then:
ddf[sapply(ddf, nrow) >= 9 & sapply(ddf, function(x) nrow(unique(x))) == 1]
should give you the desired result.
Where:
sapply(ddf, nrow) >= 9 checks whether the datatables have nine or more rows
sapply(ddf, function(x) nrow(unique(x))) == 1 checks whether all the rows are the same.
Or with one sapply call as #docendodiscimus suggested:
ddf[sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1)]
Or by using the .N special symbol and the uniqueN function of data.table:
ddf[sapply(ddf, function(x) x[,.N] >= 9 & uniqueN(x) == 1)]
Another option is to use Filter (following the suggestion of #Frank in the comments):
Filter(function(x) nrow(x) >= 9 & uniqueN(x) == 1, ddf)
Two approaches to get the datatable numbers:
1. Using which:
which(sapply(ddf, function(x) nrow(x) >= 9 & nrow(unique(x)) == 1))
2. Assign names to the datatables in the list:
names(ddf) <- paste0('dt', 1:length(ddf))
now the output will have the datatable number in the output:
$dt4
V1 V2 V3
1 a c b
2 a c b
3 a c b
4 a c b
5 a c b
6 a c b
7 a c b
8 a c b
9 a c b
If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F
I have 8 columns of variables which I must keep column 1 to 3. For column 4 to 8 I need to keep those with only 3 levels and drop which does not qualify that condition.
I tried the following command
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
It managed to retain the variables with 3 levels, but deleted my first 3 columns.
You could do a two step process:
data4 <- data2[1:3]
#Your answer for the second part here:
data3 <- data2[,sapply(data2,function(col)length(unique(col)))==3]
merge(data3,data4)
Depending on what you would like your expected output to be, could try with the option all =TRUE inside the merge().
I would suggest another approach:
x = 1:3
cbind(data2[x], Filter(function(i) length(unique(i))==3, data2[-x]))
# 1 2 3 5
#1 a 1 3 b
#2 b 2 4 b
#3 c 3 5 b
#4 d 4 6 a
#5 e 5 7 c
#6 f 6 8 c
#7 g 7 9 c
#8 h 8 10 a
#9 i 9 11 c
#10 j 10 12 b
Data:
data2 = setNames(
data.frame(letters[1:10],
1:10,
3:12,
sample(letters[1:10],10, replace=T),
sample(letters[1:3],10, replace=T)),
1:5)
Assuming that the columns 4:8 are factor class, we can also use nlevels to filter the columns. We create 'toKeep' as the numeric index of columns to keep, and 'toFilter' as numeric index of columns to filter. We subset the dataset into two: 1) using the 'toKeep' as the index (data2[toKeep]), 2) using the 'toFilter', we further subset the dataset by looping with sapply to find the number of levels (nlevels), create logical index (==3) to filter the columns and cbind with the first subset.
toKeep <- 1:3
toFilter <- setdiff(seq_len(ncol(data2)), n)
cbind(data2[toKeep], data2[toFilter][sapply(data2[toFilter], nlevels)==3])
# V1 V2 V3 V4 V6
#1 B B D C B
#2 B D D A B
#3 D E B A B
#4 C B E C A
#5 D D A D E
#6 E B A A B
data
set.seed(24)
data2 <- as.data.frame(matrix(sample(LETTERS[1:5], 8*6, replace=TRUE), ncol=8))
I have a data frame x with 2 character columns:
x <- data.frame(a = numeric(), b = I(list()))
x[1:3,"a"] = 1:3
x[[1, "b"]] <- "a, b, c"
x[[2, "b"]] <- "d, e"
x[[3, "b"]] <- "f"
x$a = as.character(x$a)
x$b = as.character(x$b)
x
str(x)
The entries in column b are comma-separated strings of characters.
I need to produce this data frame:
1 a
1 b
1 c
2 d
2 e
3 f
I know how to do it when I loop row by row. But is it possible to do without looping?
Thank you!
Have you checked out require(splitstackshape)?
> cSplit(x, "b", ",", direction = "long")
a b
1: 1 a
2: 1 b
3: 1 c
4: 2 d
5: 2 e
6: 3 f
> s <- strsplit(as.character(x$b), ',')
> data.frame(value=rep(x$a, sapply(s, FUN=length)),b=unlist(s))
value b
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f
there you go, should be very fast:
library(data.table)
x <- data.table(x)
x[ ,strsplit(b, ","), by = a]
This question already has answers here:
How to find mode across variables/vectors within a data row in R
(3 answers)
Closed 9 years ago.
Is it possible to count unique elements in data frame row and return one with maximum occurrence and as result form the vector.
example:
a a a b b b b -> b
c v f w w r t -> w
s s d f b b b -> b
You can use apply to use table function on every row of dataframe.
df <- read.table(textConnection("a a a b b b b\nc v f w w r t\ns s d f b b b"), header = F)
df$result <- apply(df, 1, function(x) names(table(x))[which.max(table(x))])
df
## V1 V2 V3 V4 V5 V6 V7 result
## 1 a a a b b b b b
## 2 c v f w w r t w
## 3 s s d f b b b b
Yes with table
x=c("a", "a", "a", "b" ,"b" ,"b" ,"b")
table(x)
x
a b
3 4
EDIT with data.table
DT = data.table(x=sample(letters[1:5],10,T),y=sample(letters[1:5],10,T))
#DT
# x y
# 1: d a
# 2: c d
# 3: d c
# 4: c a
# 5: a e
# 6: d c
# 7: c b
# 8: a b
# 9: b c
#10: c d
f = function(x) names(table(x))[which.max(table(x))]
DT[,lapply(.SD,f)]
# x y
#1: c c
Note that if you want to keep ALL max's, you need to ask for them explicitly.
You can save them as a list inside the data.frame. If there is only one per row, then the list will be simplified to a common vector
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
With Ties for max:
df2 <- df[, 1:6]
df2$result <- apply(df2, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df2
V1 V2 V3 V4 V5 V6 result
1 a a a b b b 3, 3
2 c v f w w r 2
3 s s d f b b 2, 2
With No Ties for max:
df$result <- apply(df, 1, function(x) {T <- table(x); list(T[which(T==max(T))])})
> df
V1 V2 V3 V4 V5 V6 V7 result
1 a a a b b b b 4
2 c v f w w r t 2
3 s s d f b b b 3