Subset rows containing not 0 elements in a data.frame - r

I have a data.frame that looks like this:
cln1 cln2 cln3 cln4
0 1 2 0
3 9 7 12
1 0 13 0
4 98 23 11
I would like to subset only that rows and columns not containing 0 elements. The desired output will be:
cln1 cln2 cln3 cln4
3 9 7 12
4 98 23 11

Assuming your data.frame is called "mydf":
> mydf[!rowSums(mydf == 0), ]
cln1 cln2 cln3 cln4
2 3 9 7 12
4 4 98 23 11

Related

Creating new column names using dplyr across and .names

I have the following data frame:
df <- data.frame(A_TR1=sample(10:20, 8, replace = TRUE),A_TR2=seq(2, 16, by=2), A_TR3=seq(1, 16, by=2),
B_TR1=seq(1, 16, by=2),B_TR2=seq(2, 16, by=2), B_TR3=seq(1, 16, by=2))
> df
A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3
1 11 2 1 1 2 1
2 12 4 3 3 4 3
3 18 6 5 5 6 5
4 11 8 7 7 8 7
5 17 10 9 9 10 9
6 17 12 11 11 12 11
7 14 14 13 13 14 13
8 11 16 15 15 16 15
What I would like to do, is subtract B_TR1 from A_TR1, B_TR2 from A_TR2, and so on and create new columns from these, similar to below:
df$x_TR1 <- (df$A_TR1 - df$B_TR1)
df$x_TR2 <- (df$A_TR2 - df$B_TR2)
df$x_TR3 <- (df$A_TR3 - df$B_TR3)
> df
A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 x_TR1 x_TR2 x_TR3
1 12 2 1 1 2 1 11 0 0
2 11 4 3 3 4 3 8 0 0
3 19 6 5 5 6 5 14 0 0
4 13 8 7 7 8 7 6 0 0
5 12 10 9 9 10 9 3 0 0
6 16 12 11 11 12 11 5 0 0
7 16 14 13 13 14 13 3 0 0
8 18 16 15 15 16 15 3 0 0
I would like to name these columns "x TR1", "x TR2", etc. I tried to do the following:
xdf <- df%>%mutate(across(starts_with("A_TR"), -across(starts_with("B_TR")), .names="x TR{.col}"))
However, I get an error in mutate():
attempt to select less than one element in integerOneIndex
I also don't know how to create the proper column names, in terms of getting the numbers right -- I am not even sure the glue() syntax allows for it. Any help appreciated here.
We could use .names in the first across to replace the substring 'a' with 'x' from the column names (.col) while subtracting from the second set of columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("A_TR"),
.names = "{str_replace(.col, 'A', 'x')}") -
across(starts_with("B_TR")))
-output
df
A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 x_TR1 x_TR2 x_TR3
1 10 2 1 1 2 1 9 0 0
2 10 4 3 3 4 3 7 0 0
3 16 6 5 5 6 5 11 0 0
4 12 8 7 7 8 7 5 0 0
5 20 10 9 9 10 9 11 0 0
6 19 12 11 11 12 11 8 0 0
7 17 14 13 13 14 13 4 0 0
8 14 16 15 15 16 15 -1 0 0

Need sum of data rows and columns using R

I need to get the sum of my data from the rows and columns. I uploaded my data in csv and then removed NA to replace them with zeros. I just can’t get my data to read as integers and the sum it up.
data<-read.csv("DataSet.2.csv",header=FALSE)
mode(data)
[1] "list"
data[is.na(data)]=0
data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Var_1 Var_2 Var_3 Var_4 Var_5 Var_6 Var_7 Var_8 Var_9 Var_10
2 Crow 8 8 0 3 2 4 4 44 0 23
3 Mouse 2 0 5 4 2 6 36 636 2 2
4 Boar 15 113 48 36 15 66 14 0 2 23
5 Plain 8 17 164 14 91 0 6 10 6 32
6 Silver.Carp 3 1 0 6 7 0 35 35 0 432
7 Dog 1 0 27 0 0 11 0 0 7 43
8 Bingo 2 3 1 15 1 21 0 0 1 0
9 Chrysalis 1 0 2 0 47 0 0 0 7 3
10 Apple 2 0 3 0 0 0 0 0 5 4
11 Cork 3 0 1 0 461 8 2305 15 0 2
12 Ant 11 0 2 0 0 0 0 91 4 0
13 Cat.Claw 2 22 1 110 2 7 10 7 0 0
14 Aardvark 3 1 0 5 25 30 125 0 5 4
15 Carriage 0 3 3 15 0 533 0 1 7 3
16 Airplane 3 2 1 10 0 28 0 47 7 1
17 Clipper 2 1 2 5 0 507 0 0 23 2
18 Armadillo 3 2 4 11 24 0 2 10 3322 0
19 Cork 3 3 1 9 461 88 2305 15 233 3
20 Colt 3 4 1 10 4902 0 0 1 4322 111
21 Cat 3 22 2 220 3 11 10 7 2333 22
V12
1 Var_11
2 15
3 4
4 13
5 3
6 312
7 1
8 22
9 12
10 0
11 0
12 23
13 32
14 44
15 43
16 2
17 33
18 2
19 3
20 55
21 3
#When I use as.numeric I am getting an error
data2<-as.numeric(data)
Error: 'list' object cannot be coerced to type 'double'
It looks like your .csv file contains a header ('Var_1', 'Var_2', etc.) but you are specifying header=FALSE when you load the data, so those strings are being interpreted as data values. Additionally, it looks like your first column represents row names for your dataset. You can specify this via the row.names argument.
Instead, load the data using:
data <- read.csv("DataSet.2.csv", header=TRUE, row.names = 1)
Once the data is loaded you can get the column and row sums via the functions colSums() and rowSums(), respectively. Additionally, if you are replacing the NA values with 0s just for the computation of the sums, you can skip that skip by setting the parameter, na.rm = TRUE within colSum() and rowSums(). This will remove the NA values from the collection of the sums. For example:
data <- read.csv("DataSet.2.csv", header=TRUE, row.names = 1)
row_sum <- rowSums(data, na.rm = TRUE)
col_sum <- colSums(data, na.rm = TRUE)

Group rows and add sum column of unique values

Here an example of my data.frame:
df = read.table(text='colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0', header = TRUE)
I need to group the rows by colA and colC and add a new column which states the sum of unique values based on colB.
In steps here what I need to do for this specific data.frame:
group rows with colA = 10 and 9, colA = 2 and 1, colA = 22 and colA = 11;
find the unique values of colB per each group;
add the unique values in a new col (newcolD).
Note that colC states the total number of observations for colA = 10 and 9, colA = 2 and 1, colA = 22 and colA = 11.
The data.frame needs to remain ordered decreasingly by colC.
My expected output is:
colA colB colC newcolD
10 11 7 5
10 34 7 5
10 89 7 5
10 21 7 5
9 8 0 5
9 11 0 5
9 21 0 5
2 23 5 4
2 21 5 4
2 56 5 4
1 45 0 4
1 23 0 4
22 14 3 3
22 19 3 3
22 90 3 3
11 19 2 2
11 45 2 2
To note that in df the colB duplicated values are: 11 and 21 for group 10 and 9, and 23 for group 2 and 1.
You can do that with dplyr. The trick is to create a new grouping column which groups consecutive values in colA. This is done with cumsum(c(1, diff(colA) < -1) in the example below.
df1 = read.table(text='colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0', header = TRUE,stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
arrange(desc(colA)) %>%
group_by(group_sequential = cumsum(c(1, diff(colA) < -1))) %>%
mutate(newcolD=n_distinct(colB))
colA colB colC group_sequential newcolD
<int> <int> <int> <dbl> <int>
1 22 14 3 1 3
2 22 19 3 1 3
3 22 90 3 1 3
4 10 11 7 2 5
5 10 34 7 2 5
6 10 89 7 2 5
7 10 21 7 2 5
8 9 8 0 2 5
9 9 11 0 2 5
10 9 21 0 2 5
11 2 23 5 3 4
12 2 21 5 3 4
13 2 56 5 3 4
14 1 45 0 3 4
15 1 23 0 3 4
EDIT FOR NEW DATA
With the data you added, we need to create a custom grouping. I use case_when in the example below. This matches the order you show in the desired output column. In the text, you wrote that you wanted the table to be sorted by colC. To do so, change the last line to arrange(desc(colC))
df1 = read.table(text='colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0', header = TRUE,stringsAsFactors=FALSE)
library(dplyr)
df1 %>%
group_by(group_sequential = case_when(.$colA==10|.$colA==9~1,
.$colA==2|.$colA==1~2,
.$colA==22~3,
.$colA==11~4)) %>%
mutate(newcolD=n_distinct(colB)) %>%
arrange(desc(newcolD))
colA colB colC group_sequential newcolD
<int> <int> <int> <dbl> <int>
1 10 11 7 1 5
2 10 34 7 1 5
3 10 89 7 1 5
4 10 21 7 1 5
5 9 8 0 1 5
6 9 11 0 1 5
7 9 21 0 1 5
8 2 23 5 2 4
9 2 21 5 2 4
10 2 56 5 2 4
11 1 45 0 2 4
12 1 23 0 2 4
13 22 14 3 3 3
14 22 19 3 3 3
15 22 90 3 3 3
16 11 19 2 4 2
17 11 45 2 4 2
You're really not making it easy for us, reposting slight variations of the same question instead of updating the old one and presenting conditions that are vague and inconsistent with what the desired output implies. Anyhow, here is my attempt. This is more an answer to the second question you posted, as that was a bit more general in form.
It's a bit messy, it's pretty much a direct translation of your conditions into a for loop with some if statements. I chose to focus on your written conditions rather than the expected output as that was the easier one to understand. If you want a better answer, please consider cleaning up you question(s) considerably.
df1 <- read.table(text="
colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0", header=TRUE)
df2 <- read.table(text="
colA colB colC
10 11 7
10 34 7
10 89 7
10 21 7
2 23 5
2 21 5
2 56 5
33 24 3
33 78 3
22 14 3
22 19 3
22 90 3
11 19 2
11 45 2
1 45 0
1 23 0
9 8 0
9 11 0
9 21 0
32 11 0", header=TRUE)
df <- df1
for (i in 1:nrow(df)) {
df$colD[i] <- ifelse(df$colC[i] == 0,
0,
length(unique(df$colA[1:i])))
if (any(df$colA[i]-1 == df$colA[1:i]) & df$colC[i] != 0) {
df$colD[i] <- df$colD[which(df$colA[i]-1 == df$colA[1:i])][1]
}
}
# colA colB colC colD
# 10 11 7 1
# 10 34 7 1
# 10 89 7 1
# 10 21 7 1
# 2 23 5 2
# 2 21 5 2
# 2 56 5 2
# 22 14 3 3
# 22 19 3 3
# 22 90 3 3
# 11 19 2 1
# 11 45 2 1
# 1 45 0 0
# 1 23 0 0
# 9 8 0 0
# 9 11 0 0
# 9 21 0 0
df <- df2
for (i in 1:nrow(df)) {
df$colD[i] <- ifelse(df$colC[i] == 0,
0,
length(unique(df$colA[1:i])))
if (any(df$colA[i]-1 == df$colA[1:i]) & df$colC[i] != 0) {
df$colD[i] <- df$colD[which(df$colA[i]-1 == df$colA[1:i])][1]
}
}
df
# colA colB colC colD
# 10 11 7 1
# 10 34 7 1
# 10 89 7 1
# 10 21 7 1
# 2 23 5 2
# 2 21 5 2
# 2 56 5 2
# 33 24 3 3
# 33 78 3 3
# 22 14 3 4
# 22 19 3 4
# 22 90 3 4
# 11 19 2 1
# 11 45 2 1
# 1 45 0 0
# 1 23 0 0
# 9 8 0 0
# 9 11 0 0
# 9 21 0 0
# 32 11 0 0
To also group the rows where colC is zero, it's sufficient to adjust the conditionals like this:
for (i in 1:nrow(df)) {
df$colD[i] <- length(unique(df$colA[1:i]))
if (any(df$colA[i]-1 == df$colA[1:i])) {
df$colD[i] <- df$colD[which(df$colA[i]-1 == df$colA[1:i])][1]
}
}

Flag first by-group in R data frame

I have a data frame which looks like this:
id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21
I'd like to identify a way to flag the first occurrence of id -- similar to first. and last. in SAS. I've tried the !duplicated function, but I need to actually append the "flag" column to my data frame since I'm running it through a loop later on. I'd like to get something like this:
id score first_ind
1 15 1
1 18 0
1 16 0
2 10 1
2 9 0
3 8 1
3 47 0
3 21 0
> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
You can find the edges using diff.
x <- read.table(text = "id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21", header = TRUE)
x$first_id <- c(1, diff(x$id))
x
id score first_id
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
Using plyr:
library("plyr")
ddply(x,"id",transform,first=as.numeric(seq(length(score))==1))
or if you prefer dplyr:
x %>% group_by(id) %>%
mutate(first=c(1,rep(0,n-1)))
(although if you're operating completely in the plyr/dplyr framework you probably wouldn't need this flag variable anyway ...)
Another base R option:
df$first_ind <- ave(df$id, df$id, FUN = seq_along) == 1
df
# id score first_ind
#1 1 15 TRUE
#2 1 18 FALSE
#3 1 16 FALSE
#4 2 10 TRUE
#5 2 9 FALSE
#6 3 8 TRUE
#7 3 47 FALSE
#8 3 21 FALSE
This also works in case of unsorted ids. If you want 1/0 instead of T/F you can easily wrap it in as.integer(.).

Insert new columns based on the union of colnames of two data frames

I want to write a R function to insert many 0 vectors into a existed data.frame. Here is the example:
Data.frame 1
A B C D
1 1 3 4 5
2 4 5 6 7
3 4 5 6 2
4 4 55 2 3
Data.frame 2
A B E X
11 5 1 5 5
22 44 55 9 6
33 12 4 2 4
44 9 7 4 2
Based on the union of two colnames (that is A,B,C,D,E, X), I want to update the two data frames like:
Data.frame 1 (new)
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
Data.frame 2 (new)
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Thanks in advance.
Option 1 (Thanks #Jilber for the edits)
I'm assuming the order of columns don't matter -
df2part <- subset(df2,select = setdiff(colnames(df2),colnames(df1)))*0
df1f <- cbind(df1,df2part)
df1part <- subset(df1,select = setdiff(colnames(df1),colnames(df2)))*0
df2f <- cbind(df2,df1part)
If the order really matters, then just reorder the columns
df2f <- df2f[, sort(names(df2f))]
Output
> df1f
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
> df2f
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Option 2 -
library(data.table)
df1 <- data.table(df1)
df2 <- data.table(df2)
df1names <- colnames(df1)
df2names <- colnames(df2)
df1[,setdiff(df2names,df1names) := 0]
df2[,setdiff(df1names,df2names) := 0]

Resources