Counting unique values across a row - r

I want to check that columns are consistent for each ID number (they're supposed to be constants, but there may be some doubt in the data, so I want to double check)
For example, given the following data frame:
test <- data.frame(ID = c("one","two","three"),
a = c(1,1,1),
b = c(1,1,1),
t = c(NA,1,1),
d = c(2,4,1))
I want to check that columns a,b,c and d are all the same, disregarding missing values. I thought I could do this by counting the unique values in the relevant columns, so then I can select only the rows where the number of unique values is more than 1... I imagine this is likely not the best way of doing that, but it was the only way I could think with my limited knowledge.
I found this question here, which seems to be similar to what I want to do:
Find unique values across a row of a data frame
But I am struggling to apply the answers to my data. I have tried this, which didn't do anything (but I've never used a for-loop before, so I've probably done that wrong), although when I run the inside of the function on it's own for a single row it does exactly what I hope for:
yeartest <- function(x){
temp <- test[x,2:5]
temp <- as.numeric(temp)
veclength <- length(unique(temp[!is.na(temp)]))
temp2 <- c(temp,veclength)
test[,"thing"] <- NA
test[x,2:6] <- temp2
}
for(i in 1:nrow(test)){
yeartest(i)
}
Then I tried from the accepted answer, to apply that:
x <- test
# dups <- function(x) x[!duplicated(x)]
yeartest <- function(x){
# x <- 1
temp <- test[x,2:5]
temp <- as.numeric(temp)
veclength <- length(unique(temp[!is.na(temp)]))
temp2 <- c(temp,veclength)
test[,"thing"] <- NA
test[x,2:6] <- temp2
}
new.df <- t(apply(x, 1, function(x) yeartest(x)))
Which gives an error and so it is pretty obvious that I have made a mistake in my translation of the answer to my data.
Apologies, this must be a really obvious failing on my part, I am very grateful for any help.
Solution: (thank you for the help!)
test$new <- apply(test[,2:5],1,function(r) length(unique(na.omit(r))))

> df <- data.frame(
a=sample(2,10,replace=TRUE),
b=sample(2,10,replace=TRUE),
c=sample(c("a","b"),10,replace=TRUE),
d=sample(c("a","b"),10,replace=TRUE))
> df[c(3,6,8),1] <- NA
> df
a b c d
1 1 2 a b
2 1 2 a b
3 NA 2 a a
4 2 2 a b
5 1 2 a a
6 NA 1 a b
7 2 1 b b
8 NA 1 a a
9 1 1 b b
10 2 2 b b
> apply(df,1,function(r) length(unique(na.omit(r))))
[1] 3 3 2 4 3 2 4 2 3 3

Related

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

Extract only first line in a data frame from several subgroups that satisfy a conditional

I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2

R: How can I sum across variables, within cases, while counting NA as zero

Fake data for illustration:
df <- data.frame(a=c(1,2,3,4,5), b=(c(2,2,2,2,NA)),
c=c(NA,2,3,4,5)))
This would get me the answer I want IF it weren't for the NA values:
df$count <- with(df, (a==1) + (b==2) + (c==3))
Also, would there be an even more elegant way if I was only interested in, e.g. variables==2?
df$count <- with(df, (a==2) + (b==2) + (c==2))
Many thanks!
The following works for your specific example, but I have a suspicion that your real use case is more complicated:
df$count <- apply(df,1,function(x){sum(x == 1:3,na.rm = TRUE)})
> df
a b c count
1 1 2 NA 2
2 2 2 2 1
3 3 2 3 2
4 4 2 4 1
5 5 NA 5 0
but this general approach should work. For instance, your second example would be something like this:
df$count <- apply(df,1,function(x){sum(x == 2,na.rm = TRUE)})
or more generally you could allow yourself to pass in a variable for the comparison:
df$count <- apply(df,1,function(x,compare){sum(x == compare,na.rm = TRUE)},compare = 1:3)
Another way is to subtract your target vector from each row of your data.frame, negate and then do rowSums with na.rm=TRUE:
target <- 1:3
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 2 1 2 1 0
target <- rep(2,3)
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 1 3 1 1 0

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

How do I take subsets of a data frame according to a grouping in R?

I have an aggregation problem which I cannot figure out how to perform efficiently in R.
Say I have the following data:
group1 <- c("a","b","a","a","b","c","c","c","c",
"c","a","a","a","b","b","b","b")
group2 <- c(1,2,3,4,1,3,5,6,5,4,1,2,3,4,3,2,1)
value <- c("apple","pear","orange","apple",
"banana","durian","lemon","lime",
"raspberry","durian","peach","nectarine",
"banana","lemon","guava","blackberry","grape")
df <- data.frame(group1,group2,value)
I am interested in sampling from the data frame df such that I randomly pick only a single row from each combination of factors group1 and group2.
As you can see, the results of table(df$group1,df$group2)
1 2 3 4 5 6
a 2 1 2 1 0 0
b 2 2 1 1 0 0
c 0 0 1 1 2 1
shows that some combinations are seen more than once, while others are never seen. For those that are seen more than once (e.g., group1="a" and group2=3), I want to randomly pick only one of the corresponding rows and return a new data frame that has only that subset of rows. That way, each possible combination of the grouping factors is represented by only a single row in the data frame.
One important aspect here is that my actual data sets can contain anywhere from 500,000 rows to >2,000,000 rows, so it is important to be mindful of performance.
I am relatively new at R, so I have been having trouble figuring out how to generate this structure correctly. One attempt looked like this (using the plyr package):
choice <- function(x,label) {
cbind(x[sample(1:nrow(x),1),],data.frame(state=label))
}
df <- ddply(df[,c("group1","group2","value")],
.(group1,group2),
pick_junc,
label="test")
Note that in this case, I am also adding an extra column to the data frame called "label" which is specified as an extra argument to the ddply function. However, I killed this after about 20 min.
In other cases, I have tried using aggregate or by or tapply, but I never know exactly what the specified function is getting, what it should return, or what to do with the result (especially for by).
I am trying to switch from python to R for exploratory data analysis, but this type of aggregation is crucial for me. In python, I can perform these operations very rapidly, but it is inconvenient as I have to generate a separate script/data structure for each different type of aggregation I want to perform.
I want to love R, so please help! Thanks!
Uri
Here is the plyr solution
set.seed(1234)
ddply(df, .(group1, group2), summarize,
value = value[sample(length(value), 1)])
This gives us
group1 group2 value
1 a 1 apple
2 a 2 nectarine
3 a 3 banana
4 a 4 apple
5 b 1 grape
6 b 2 blackberry
7 b 3 guava
8 b 4 lemon
9 c 3 durian
10 c 4 durian
11 c 5 raspberry
12 c 6 lime
EDIT. With a data frame that big, you are better off using data.table
library(data.table)
dt = data.table(df)
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
EDIT 2: Performance Comparison: Data Table is ~ 15 X faster
group1 = sample(letters, 1000000, replace = T)
group2 = sample(LETTERS, 1000000, replace = T)
value = runif(1000000, 0, 1)
df = data.frame(group1, group2, value)
dt = data.table(df)
f1_dtab = function() {
dt[,list(value = value[sample(length(value), 1)]),'group1, group2']
}
f2_plyr = function() {ddply(df, .(group1, group2), summarize, value =
value[sample(length(value), 1)])
}
f3_by = function() {do.call(rbind,by(df,list(grp1 = df$group1,grp2 = df$group2),
FUN = function(x){x[sample(nrow(x),1),]}))
}
library(rbenchmark)
benchmark(f1_dtab(), f2_plyr(), f3_by(), replications = 10)
test replications elapsed relative
f1_dtab() 10 4.764 1.00000
f2_plyr() 10 68.261 14.32851
f3_by() 10 67.369 14.14127
One more way:
with(df, tapply(value, list( group1, group2), length))
1 2 3 4 5 6
a 2 1 2 1 NA NA
b 2 2 1 1 NA NA
c NA NA 1 1 2 1
# Now use tapply to sample withing groups
# `resample` fn is from the sample help page:
# Avoids an error with sample when only one value in a group.
resample <- function(x, ...) x[sample.int(length(x), ...)]
#Create a row index
df$idx <- 1:NROW(df)
rowidxs <- with(df, unique( c( # the `c` function will make a matrix into a vector
tapply(idx, list( group1, group2),
function (x) resample(x, 1) ))))
rowidxs
# [1] 1 5 NA 12 16 NA 3 15 6 4 14 10 NA NA 7 NA NA 8
df[rowidxs[!is.na(rowidxs)] , ]

Resources