R - Subset dataframe where 2 columns have values [duplicate] - r

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 6 years ago.
How can I subset a dataframe where 2 columns have values?
For example:
A B
1 2
3
5 6
8
becomes
A B
1 2
5 6

> subset(df, !is.na(df$A) & !is.na(df$B))
> df[!is.na(df$A) & !is.na(df$B),]
> df[!is.na(rowSums(df)),]
> na.omit(df)
all equivalent

One easiest way is to use na.omit (if you are targeting NA values).
Kindly go through following R code snippet:
> x
a b
1 1 2
2 3 NA
3 5 6
4 NA 8
> na.omit(x)
a b
1 1 2
3 5 6
Another way is to use complete.cases as shown below:
> x[complete.cases(x),]
a b
1 1 2
3 5 6
You can also use na.exclude as shown below:
> na.exclude(x)
a b
1 1 2
3 5 6
Hope it works for you!

Related

R Repeat Rows Data Table [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
library(data.table)
dataHAVE=data.frame("student"=c(1,2,3),
"score" = c(10,11,12),
"count"=c(4,1,2))
dataWANT=data.frame("student"=c(1,1,1,1,2,3,3),
"score"=c(10,10,10,10,11,12,12),
"count"=c(4,4,4,4,1,2,2))
setDT(dataHAVE)dataHAVE[rep(1:.N,count)][,Indx:=1:.N,by=student]
I have data 'dataHAVE' and seek to produce 'dataWANT' that basically copies each 'student' 'count' number of times as shown in 'dataWANT'. I try doing this as shown above in data.table as this is the solution I seek but get error
Error: unexpected symbol in "setDT(dat)dat"
and I cannot resolve thank you so much.
Try:
setDT(dataHAVE)[rep(1:.N,count)]
Output:
student score count
1: 1 10 4
2: 1 10 4
3: 1 10 4
4: 1 10 4
5: 2 11 1
6: 3 12 2
7: 3 12 2
As explained you could also replace 1:.N and do setDT(dataHAVE)[dataHAVE[, rep(.I, count)]].
Just FYI, there's also a nice function in tidyr that does similar thing:
tidyr::uncount(dataHAVE, count, .remove = FALSE)
Here is a base R solution
dataWANT<-do.call(rbind,
c(with(dataHAVE,rep(split(dataHAVE,student),count)),
make.row.names = FALSE))
such that
> dataWANT
student score count
1 1 10 4
2 1 10 4
3 1 10 4
4 1 10 4
5 2 11 1
6 3 12 2
7 3 12 2

Tallying values in single column and separating into Rows in R [duplicate]

This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Closed 6 years ago.
I have a single row of numbers. I'm wondering how I can separate it out so that it outputs columns that total the tally of each set of numbers. I've tried playing around with "separate" but I can't figure out how to make it work.
Here's my data frame:
2
2
2
2
2
4
4
4
I'd like it to be
2 4
5 3
You can use the table() function.
> df
V1
1 2
2 2
3 2
4 2
5 2
6 4
7 4
8 4
> table(df$V1)
2 4
5 3
We can use tabulate which would be faster
tabulate(factor(df1$V1))
#[1] 5 3

Most frequent value (mode) by group [duplicate]

This question already has answers here:
Create a variable capturing the most frequent occurence by group
(3 answers)
How to find the statistical mode?
(35 answers)
Closed 5 years ago.
I am trying to find the most frequent value by group. In the following example dataframe:
df<-data.frame(a=c(1,1,1,1,2,2,2,3,3),b=c(2,2,1,2,3,3,1,1,2))
> df
a b
1 1 2
2 1 2
3 1 1
4 1 2
5 2 3
6 2 3
7 2 1
8 3 1
9 3 2
I would like to add a column 'c' which has the most occurring value in 'b' when its values are grouped by 'a'. I would like the following output:
> df
a b c
1 1 2 2
2 1 2 2
3 1 1 2
4 1 2 2
5 2 3 3
6 2 3 3
7 2 1 3
8 3 1 1
9 3 2 1
I tried using table and tapply but didn't get it right. Is there a fast way to do that?
Thanks!
Building on Davids comments your solution is the following:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
df %>% group_by(a) %>% mutate(c=Mode(b))
Notice though that for the tie when df$a is 3 then the mode for b is 1.
We could get the 'Mode' of 'b' grouped by 'a' using ave
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$c <- with(df, ave(b, a, FUN=Mode))
df$c
#[1] 2 2 2 2 3 3 3 1 1
Or using data.table
library(data.table)
setDT(df)[, c:= Mode(b), by=a][]
Here is a base R method that uses table to calculate a cross tab, max.col to find the mode per group, and rep together with rle to fill in the mode across groups.
# calculate a cross tab, frequencies by group
myTab <- table(df$a, df$b)
# repeat the mode for each group, as calculated by colnames(myTab)[max.col(myTab)]
# repeating by the number of times the group ID is observed
df$c <- rep(colnames(myTab)[max.col(myTab)], rle(df$a)$length)
df
a b c
1 1 2 2
2 1 2 2
3 1 1 2
4 1 2 2
5 2 3 3
6 2 3 3
7 2 1 3
8 3 1 2
9 3 2 2
Note that this assumes the data has been sorted by group. Also, the default of max.col is to break ties (mulitple modes) at random. If you want the first or last value to be the mode, you can set this using the ties.method argument.

Select max or equal value from several columns in a data frame

I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.
There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources