Using apply() and If() statement to sum() two columns - r

I have a dataframe with 2 columns and I want to use a if/else condition when using the apply function to sum() the rows in each column - specifically, for all the rows where Col1 >= Col2 take the sum() of Col1 and store it in variable a, and for all the rows where Col1 < Col2 take the sum() of Col1 and store it in variable b.
For example
df<-data.frame(Col1=c(1,2,3,4,5),Col2=c(5,4,3,2,1))
df
Col1 Col2
1 5
2 4
3 3
4 2
5 1
There are three instances in which Col1 >= Col2, so in Col1 I take the sum() of 3+4+5, which is 12. There are two instances in which Col1 < Col2, so in Col1 I take the sum() of 1+2, which is 3. So
>a
12
>b
3
This is the code I created, but it's still in the works:
apply(df, 1, function(x)
if(df$Col1 >= df$Col2)
a<-sum(df$Col1 >= df$Col2)
else
b<-sum(df$Col1 < df$Col2)
)
The code here doesn't work because it simply adds the number of times the condition is true and not the actual values.

There's really no need for any *apply() functions here, as these are fully vectorized operations. Here's how I might go about it, putting both results into a nice list.
with(df, {
x <- Col1 >= Col2
list(a = sum(Col1[x]), b = sum(Col1[!x]))
})
# $a
# [1] 12
#
# $b
# [1] 3

I'm not sure why you would want to tackle this problem with an using -apply-. It seems like an overkill. Also note that your -apply- statement lacks the margin argument with which you indicate whether you want to apply the function to rows, columns or both (also, the line defining df needs another closing paranthesis).
A simple two line solution would be this:
df<-data.frame(Col1=c(1,2,3,4,5),Col2=c(5,4,3,2,1)
a <- sum(df$Col1[df$Col1 >= df$Col2])
b <- sum(df$Col2[df$Col1 < df$Col2])

Related

Remove data.table rows whose vector elements contain nested NAs

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

R - number of unique values in a column of data frame

for a dataframe df, I need to find the unique values for some_col. Tried the following
length(unique(df["some_col"]))
but this is not giving the expected results. However length(unique(some_vector)) works on a vector and gives the expected results.
Some preceding steps while the df is created
df <- read.csv(file, header=T)
typeof(df) #=> "list"
typeof(unique(df["some_col"])) #=> "list"
length(unique(df["some_col"])) #=> 1
Try with [[ instead of [. [ returns a list (a data.frame in fact), [[ returns a vector.
df <- data.frame( some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
length(unique(df[["some_col"]]))
#[1] 4
class( df[["some_col"]] )
[1] "numeric"
class( df["some_col"] )
[1] "data.frame"
You're getting a value of 1 because the list is of length 1 (1 column), even though that 1 element contains several values.
you need to use
length(unique(unlist(df[c("some_col")])))
When you call column by df[c("some_col")] or by df["some_col"] ; it pulls it as a list. Unlist will convert it into the vector and you can work easily with it. When you call column by df$some_col .. it pulls the data column as vector
I think you might just be missing a ,
Try
length(unique(df[,"some_col"]))
In response to comment :
df <- data.frame(cbind(A=c(1:10),B=rep(c("A","B"),5)))
df["B"]
Output :
B
1 A
2 B
3 A
4 B
5 A
6 B
7 A
8 B
9 A
10 B
and
length(unique(df[,"B"]))
Output:
[1] 1
Which is the same incorrect/undesirable output as the OP posted
HOWEVER With a comma ,
df[,"B"]
Output :
[1] A B A B A B A B A B
Levels: A B
and
length(unique(df[,"B"]))
Now gives you the correct/desired output by the OP. Which in this example is 2
[1] 2
The reason is that df["some_col"] calls a data.frame and length call to an object class data.frame counts the number of data.frames in that object which is 1, while df[,"some_col"] returns a vector and length call to a vector correctly returns the number of elements in that vector. So you see a comma (,) makes all the difference.
using tidyverse
df %>%
select("some_col") %>%
n_distinct()
The data.table package contains the convenient shorthand uniqueN. From the documentation
uniqueN is equivalent to length(unique(x)) when x is anatomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.
You can use it with a data frame:
df <- data.frame(some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
data.table::uniqueN(df[['some_col']])
[1] 4
or if you already have a data.table
dt <- setDT(df)
dt[,uniqueN(some_col)]
[1] 4
Here is another option:
df %>%
distinct(column_name) %>%
count()
or this without tidyverse:
count(distinct(df, column_name))
checking benchmarks in the web you will see that distinct() is fast.

R - Select Rows Where Number of Values Satisfies Condition

I have a dataframe called df, what I want to do is select all rows where there are at least n values in that row satisfying some condition c.
For example, I want rows from df such that at least 50% of the values (or columns) in the row are greater than 0.75.
Here is what I came up with to accomplish this:
test <- df[apply(df, 1, function(x) (length(x[x > 0.75]) / length(x) > 0.5)]
Unfortunately I am getting this error message:
Error in `[.data.frame`(df, apply(df, :
undefined columns selected
I am very new to R, so I'm pretty stuck at this point, what's the problem here?
You are getting that error message because you haven't told R what columns you want to include in your subset.
You have:
df[your_apply_function]
Which doesn't specify which columns. Instead, you should try
df[your_apply_function, ]
That means 'subset 'df' for all rows that match the result of this apply function, and all columns'. Edit: I don't think this will work either.
However, I would approach it by using dplyr:
library(dplyr)
rowcounts <- apply(df, 1, function(x) rowSums(x > 0.75))
df <- bind_cols(df, rowcounts)
df <- filter(df, rowcounts > ncol(df)/2)
I didn't get to test this yet (code still running on my machine), but it looks right to my eye. When I get a chance I will test it.
This can be accomplished with a cellwise comparison against 0.75, rowSums(), and then a vectorized comparison against 0.5:
set.seed(3L); NR <- 5L; NC <- 4L; df <- as.data.frame(matrix(rnorm(NR*NC,0.75,0.1),NR));
df;
## V1 V2 V3 V4
## 1 0.6538067 0.7530124 0.6755218 0.7192344
## 2 0.7207474 0.7585418 0.6368781 0.6546983
## 3 0.7758788 0.8616610 0.6783642 0.6851757
## 4 0.6347868 0.6281143 0.7752652 0.8724314
## 5 0.7695783 0.8767369 0.7652046 0.7699812
df[rowSums(df>0.75)/ncol(df)>=0.5,];
## V1 V2 V3 V4
## 3 0.7758788 0.8616610 0.6783642 0.6851757
## 4 0.6347868 0.6281143 0.7752652 0.8724314
## 5 0.7695783 0.8767369 0.7652046 0.7699812
This can work on both matrices and data.frames.

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

How to select rows from data.frame with 2 conditions

I have an aggregated table:
> aggdata[1:4,]
Group.1 Group.2 x
1 4 0.05 0.9214660
2 6 0.05 0.9315789
3 8 0.05 0.9526316
4 10 0.05 0.9684211
How can I select the x value when I have values for Group.1 and Group.2?
I tried:
aggdata[aggdata[,"Group.1"]==l && aggdata[,"Group.2"]==lamda,"x"]
but that replies all x's.
More info:
I want to use this like this:
table = data.frame();
for(l in unique(aggdata[,"Group.1"])) {
for(lambda in unique(aggdata[,"Group.2"])) {
table[l,lambda] = aggdata[aggdata[,"Group.1"]==l & aggdata[,"Group.2"]==lambda,"x"]
}
}
Any suggestions that are even easier and giving this result I appreciate!
The easiest solution is to change "&&" to "&" in your code.
> aggdata[aggdata[,"Group.1"]==6 & aggdata[,"Group.2"]==0.05,"x"]
[1] 0.9315789
My preferred solution would be to use subset():
> subset(aggdata, Group.1==6 & Group.2==0.05)$x
[1] 0.9315789
Use & not &&. The latter only evaluates the first element of each vector.
Update: to answer the second part, use the reshape package. Something like this will do it:
tablex <- recast(aggdata, Group.1 ~ variable * Group.2, id.var=1:2)
# Now add useful column and row names
colnames(tablex) <- gsub("x_","",colnames(tablex))
rownames(tablex) <- tablex[,1]
# Finally remove the redundant first column
tablex <- tablex[,-1]
Someone with more experience using reshape may have a simpler solution.
Note: Don't use table as a variable name as it conflicts with the table() function.
There is a really helpful document on subsetting R data frames at:
http://www.ats.ucla.edu/stat/r/modules/subsetting.htm
Here is the relevant excerpt:
Subsetting rows using multiple
conditional statements: There is no
limit to how many logical statements
may be combined to achieve the
subsetting that is desired. The data
frame x.sub1 contains only the
observations for which the values of
the variable y is greater than 2 and
for which the variable V1 is greater
than 0.6.
x.sub1 <- subset(x.df, y > 2 & V1 > 0.6)

Resources