subset a dataframe based on sum of a column - r

I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance

Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701

I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362

Related

Adding new information to a table upon matching rows

I have very basic knowledge of R. I have two tabs (A and B) with rows I want to compare - some values match and some don't. I want R to find the matching elements and add the text value "E" to a pre-existing row in tab A if this is the case.
Example:
Tab A
ID Existing?
1 A
2 B
3 C
4 D
5 E
Tab B
ID
1 D
2 B
3 Y
4 A
5 W
Upon match:
Tab A
ID Existing?
1 A E
2 B E
3 C
4 D E
5 E
I have found information online on how to match tables but none on how to write new information when the match takes place.
Please explain like I'm 5... I have no programming background.
Thank you in advance!
Use match to get the elements in df1$ID that are also in df2$ID, and ifelse to recode the values that are both in df1 and in df2 with "E", and NA otherwise.
df1 <- data.frame(ID = LETTERS[1:5])
df2 <- data.frame(ID = c("D", "B", "Y", "A", "W"))
df1$Existing <- ifelse(match(df1$ID, df2$ID), "E", NA)
ID Existing
1 A E
2 B E
3 C <NA>
4 D E
5 E <NA>
Another solution - using dplyr - would be to join the two dataframes, where you have added the column Existing to the one being joined:
library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(ID = LETTERS[1:5])
df2 <- tibble(ID = c("D", "B", "Y", "A", "W"))
df1 %>%
left_join(df2 %>% mutate(Existing = "E"))
#> Joining, by = "ID"
#> # A tibble: 5 x 2
#> ID Existing
#> <chr> <chr>
#> 1 A E
#> 2 B E
#> 3 C <NA>
#> 4 D E
#> 5 E <NA>
This will set all matching IDs to E and all non-matching to NA.
# data
tab1 <- structure(list(ID = c("A", "B", "C", "D", "E"), Existing = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_)), class = "data.frame", row.names = c(NA,
-5L))
tab2 <- structure(list(ID = c("D", "B", "Y", "A", "W")), class = "data.frame", row.names = c(NA,
-5L))
There are many ways to skin this cat. In base-R, you could try, e.g.,
tab1$Existing[tab1$ID %in% tab2$ID] <- 'E'
In practise, for anything more complicated than tables with 6 rows, you could try dplyr:
library(dplyr)
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA))
Another useful tool -- with a slightly differing syntax -- is data.table.
library(data.table)
setDT(tab1) -> tab1
setDT(tab2) -> tab2
tab1[,Existing := ifelse(tab1$ID %in% tab2$ID, 'E',NA)]
Note that, here mutate and := play roughly the same role. Probably, if you work more with R, you will develop an affinity with one of the "dialects" above.
EDIT: To drop the rows NA values values (in dplyr), you could either do:
tab1 %>% mutate(Existing = ifelse(ID %in% tab2$ID, 'E',NA)) %>%
filter(!is.na(Existing))
Or piggy-backing on #jpiversen's solution:
df1 %>%
inner_join(df2 %>% mutate(Existing = "E"))

R sum of aggregate columns found in another column

Given this data, the first 4 columns (rowid, order, line, special), I need to create a column, numSpecial as such:
rowid order line special numSpecial
1 A 01 X 1
2 B 01 0
3 B 02 X 2
4 B 03 X 2
5 C 01 X 1
6 C 02 0
Where numSpecial is determined by summing the number of times for each order that is special (value = X), given that order-line is special itself, otherwise its 0.
I first tried adding a column that simply concats 'order' with 'X', call it orderX, and would look like:
orderX
AX
BX
BX
BX
CX
CX
Then do a sum of order & special in orderx:
df$numSpecial <- sum(paste(order, special, sep = "") %in% orderx)
But that doesnt work, it returns the sum of the results for all rows for every order:
numSpecial
4
4
4
4
4
4
I then tried as.data.table, but I'm not getting the expected results using:
as.data.table(mydf)[, numSpecial := sum(paste(order, special, sep = "") %in% orderx), by = rowid]
However that is returning just 1 for each row and not sums:
numSpecial
1
0
1
1
1
0
Where am I going wrong with these? I shouldn't have to create that orderX column either I don't think, but I can't figure out the way to get this count right. It's similar to a countif in excel which is easy to do.
There's probably several ways, but you could just multiply it by a TRUE/FALSE flag of "X" being present:
dat[, numSpecial := sum(special == "X") * (special == "X"), by=order]
dat
# rowid order line special numSpecial
#1: 1 A 1 X 1
#2: 2 B 1 0
#3: 3 B 2 X 2
#4: 4 B 3 X 2
#5: 5 C 1 X 1
#6: 6 C 2 0
You could also do it a bit differently like:
dat[, numSpecial := 0L][special == "X", numSpecial := .N, by=order]
Where dat was:
library(data.table)
dat <- structure(list(rowid = 1:6, order = c("A", "B", "B", "B", "C",
"C"), line = c(1L, 1L, 2L, 3L, 1L, 2L), special = c("X", "",
"X", "X", "X", "")), .Names = c("rowid", "order", "line", "special"
), row.names = c(NA, -6L), class = "data.frame")
setDT(dat)
You could use ave with a dummy variable (just filled with 1s):
df$numSpecial <- ifelse(df$special == "X", ave(rep(1,nrow(df)), df$order, df$special, FUN = length), 0)
df
# rowid order line special numSpecial
#1 1 A 1 X 1
#2 2 B 1 0
#3 3 B 2 X 2
#4 4 B 3 X 2
#5 5 C 1 X 1
#6 6 C 2 0
Note I read in your data without the numSpecial column.
Using the dplyr package:
library(dplyr)
df %>% group_by(order) %>%
mutate(numSpecial = ifelse(special=="X", sum(special=="X"), 0))
rowid order special numSpecial
1 1 A X 1
2 2 B 0
3 3 B X 2
4 4 B X 2
5 5 C X 1
6 6 C 0
One other option using base R only would be to use aggregate:
# Your data
df <- data.frame(rowid = 1:6, order = c("A", "B", "B", "B", "C", "C"), special = c("X", "", "X", "X", "X", ""))
# Make the counts
dat <- with(df,aggregate(x=list(answer=special),by=list(order=order,special=special),FUN=function(x) sum(x=="X")))
# Merge back to original dataset:
dat.fin <- merge(df,dat,by=c('order','special'))

How to count the row frequency

What is the simplest way to simplify the first two columns of the a data, so that each row is counted to with a new variable freq?
In other words, go from this:
var1 var2
1 a d
2 b e
3 b e
4 c f
5 c f
6 c f
To this:
var1 var2 freq
1 a d 1
2 b e 2
3 c f 3
You probably did not take a close look with dplyr package ( You tagged it :) ). The easiest way is below ...
df <-data.frame(freq1 = c("a","b","b","c","c","c"),
freq2 = c("d","e","e","f","f","f"))
df %>% group_by(freq1,freq2) %>% tally()
Output
freq1 freq2 n
(fctr) (fctr) (int)
1 a d 1
2 b e 2
3 c f 3
I dont know if the is the easiest, but if the data isnt that complex you can create unique codes using a paste0(collapse="_") and then aggregate by that unique code using a simple table command
data<-read.csv("data.csv")
x<-apply(data,1,function(x) paste0(x,collapse = "_"))
table(x)
If for some reason you don't want to use the dplyr package's count function, an alternative is to use the contingency tables generated by the ftable function and filter out contingencies with 0 occurrences. For example:
df <- data.frame(freq1 = c("a", "b", "b", "c", "c", "c"),
freq2 = c("d", "e", "e", "f", "f", "f"))
x <- as.data.frame(ftable(df))
x <- x[x$Freq > 0, ]
This yields the output:
freq1 freq2 Freq
1 a d 1
5 b e 2
9 c f 3

Assigning unique id to each instance in R

I have a dataframe df in R as below:
>df
#x
#a
#b
#a
#c
#b
I want a new dataframe that assigns unique id for each instance as follows:
>df
#x y
#a 1
#b 1
#a 2
#c 1
#b 2
Any help would by highly appreciated..
library(dplyr)
df <- data.frame(x=c("a", "b", "a", "c", "b"))
df %>% group_by(x) %>% mutate(y=1:length(x))
# Source: local data frame [5 x 2]
# Groups: x [3]
# x y
# (fctr) (int)
# 1 a 1
# 2 b 1
# 3 a 2
# 4 c 1
# 5 b 2
Using row_number in dplyr.
df %>% group_by(x) %>% mutate(y=row_number(x))
We can use ave and specify the FUN as seq_along, grouped by the 'x' column.
df$y <- with(df, ave(x, x, FUN=seq_along))
df
# x y
#1 a 1
#2 b 1
#3 a 2
#4 c 1
#5 b 2
Or use getanID from splitstackshape
library(splitstackshape)
getanID(df, 'x')[]
Or use data.table
library(data.table)
setDT(df)[, y:= seq_len(.N), by =x]
data
df <- structure(list(x = c("a", "b", "a", "c", "b")),
.Names = "x", row.names = c(NA, -5L), class = "data.frame")

Drop last 5 columns from a dataframe without knowing specific number

I have a dataframe that is created by a for-loop with a changing number of columns.
In a different function I want the drop the last five columns.
The variable with the length of the dataframe is "units" and it has numbers between 10 an 150.
I have tried using the names of the columns to drop but it is not working. (As soon as I try to open "newframe" R studio crashes, viewing myframe is no problem).
drops <- c("name1","name2","name3","name4","name5")
newframe <- results[,!(names(myframe) %in% drops)]
Is there any way to just drop the last five columns of a dataframe without relying on names or numbers of the columns
length(df) can also be used:
mydf[1:(length(mydf)-5)]
You could use the counts of columns (ncol()):
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10), ws = rnorm(10))
# rm last 2 columns
df[ , -((ncol(df) - 1):ncol(df))]
# or
df[ , -seq(ncol(df)-1, ncol(df))]
Yo can take advantage of the list method for head() (which drops whole list elements, and works differently to the data.frame method which drops rows):
# data.frame with 26 columns (named a-z):
df <- setNames( as.data.frame( as.list(1:26)) , letters )
# drop last 5 'columns':
as.data.frame( head(as.list(df),-5) )
# a b c d e f g h i j k l m n o p q r s t u
#1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
My preferable method is using rev which makes the syntax cleaner. For mtcars data set
mtcars[-rev(seq_len(ncol(mtcars)))[1:5]]
Or using head (similar to Simons suggestion)
mtcars[head(seq_len(ncol(mtcars)), -5)]
A tidyverse option is to use last_col, where we first select the fifth column from the last column (i.e., last_col(offset = 4)) then to the last column number. Then, we use the - to remove the selected columns.
library(tidyverse)
df %>%
select(-(last_col(offset = 4):last_col()))
Output
x y z
1 1 10 5
2 2 9 5
3 3 8 5
4 4 7 5
5 5 6 5
6 6 5 5
7 7 4 5
8 8 3 5
9 9 2 5
10 10 1 5
Another option is to use ncol in the select:
df %>%
select(-((ncol(.) - 4):ncol(.)))
Or we could use tail with names:
df %>%
select(-tail(names(.), 5))
Data
df <- structure(list(x = 1:10, y = 10:1, z = c(5, 5, 5, 5, 5, 5, 5,
5, 5, 5), a = 11:20, b = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j"), c = c("t", "s", "r", "q", "p", "o", "n", "m",
"l", "k"), d = 30:39, e = 50:59), class = "data.frame", row.names = c(NA,
-10L))
If you are using data.table package for your data processing, one nice way can be
drops <- c("name1","name2","name3","name4","name5")
df[, .SD, .SDcols=!drops]
In fact, this allows you to drop any variables as you like.

Resources