Drop last 5 columns from a dataframe without knowing specific number - r

I have a dataframe that is created by a for-loop with a changing number of columns.
In a different function I want the drop the last five columns.
The variable with the length of the dataframe is "units" and it has numbers between 10 an 150.
I have tried using the names of the columns to drop but it is not working. (As soon as I try to open "newframe" R studio crashes, viewing myframe is no problem).
drops <- c("name1","name2","name3","name4","name5")
newframe <- results[,!(names(myframe) %in% drops)]
Is there any way to just drop the last five columns of a dataframe without relying on names or numbers of the columns

length(df) can also be used:
mydf[1:(length(mydf)-5)]

You could use the counts of columns (ncol()):
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10), ws = rnorm(10))
# rm last 2 columns
df[ , -((ncol(df) - 1):ncol(df))]
# or
df[ , -seq(ncol(df)-1, ncol(df))]

Yo can take advantage of the list method for head() (which drops whole list elements, and works differently to the data.frame method which drops rows):
# data.frame with 26 columns (named a-z):
df <- setNames( as.data.frame( as.list(1:26)) , letters )
# drop last 5 'columns':
as.data.frame( head(as.list(df),-5) )
# a b c d e f g h i j k l m n o p q r s t u
#1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

My preferable method is using rev which makes the syntax cleaner. For mtcars data set
mtcars[-rev(seq_len(ncol(mtcars)))[1:5]]
Or using head (similar to Simons suggestion)
mtcars[head(seq_len(ncol(mtcars)), -5)]

A tidyverse option is to use last_col, where we first select the fifth column from the last column (i.e., last_col(offset = 4)) then to the last column number. Then, we use the - to remove the selected columns.
library(tidyverse)
df %>%
select(-(last_col(offset = 4):last_col()))
Output
x y z
1 1 10 5
2 2 9 5
3 3 8 5
4 4 7 5
5 5 6 5
6 6 5 5
7 7 4 5
8 8 3 5
9 9 2 5
10 10 1 5
Another option is to use ncol in the select:
df %>%
select(-((ncol(.) - 4):ncol(.)))
Or we could use tail with names:
df %>%
select(-tail(names(.), 5))
Data
df <- structure(list(x = 1:10, y = 10:1, z = c(5, 5, 5, 5, 5, 5, 5,
5, 5, 5), a = 11:20, b = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j"), c = c("t", "s", "r", "q", "p", "o", "n", "m",
"l", "k"), d = 30:39, e = 50:59), class = "data.frame", row.names = c(NA,
-10L))

If you are using data.table package for your data processing, one nice way can be
drops <- c("name1","name2","name3","name4","name5")
df[, .SD, .SDcols=!drops]
In fact, this allows you to drop any variables as you like.

Related

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

melt() is using all column names as id variables

So, with ths dummy dataset
test_species <- c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- rbind(test_species, test_abundance)
df <- as.data.frame(df)
colnames(df) <- c("a", "b", "c", "d", "e")
df <- dplyr::slice(df, 2)
we get a dataframe that's something like this:
a b c d e
4 7 15 2 9
I'd like to transform it into something like
species abundance
a 4
b 7
c 15
d 2
e 9
using the reshape2 function melt(). I tried the code
melted_df <- melt(df,
variable.name = "species",
value.name = "abundance")
but that tells me: "Using a, b, c, d, e as id variables", and the end result looks like this:
a b c d e
4 7 15 2 9
What am I doing wrong, and how can I fix it?
You can define it in the correct shape from the start, using only base library functions:
> data.frame(species=test_species, abundance=test_abundance)
species abundance
1 a 4
2 b 7
3 c 15
4 d 2
5 e 9
Rbind is adding some odd behaviour I think, I cannot explain why.
A fairly basic fix is:
test_species <-c("a", "b", "c", "d", "e")
test_abundance <- c(4, 7, 15, 2, 9)
df <- data.frame(test_species, test_abundance) #skip rbind and go straight to df
colnames(df) <- c('species', 'abundance') #colnames correct
This skips the rbind function and will give desired outcome.

select rows by value from a data frame in a list to assign new value r

How can I select rows of a data frame in a list by value and assign a new value to a certain column?
When I run this code:
df <- data.frame(x = c(10,55,32,78,47, NA),
y = c("a", "a", "b", "b", "c", "d"))
df1 <- data.frame(x = c(7.3,5.65,3.72,7.81,4.79, NA),
y = c("a", "a", "b", "b", "c", "d"))
dat <- list("df" = df, "df1" = df1)
dat[['df']]['y' == "d", 1] <- 15
the value 15 is assigned to all the values of column x and y.
I only want the column x at y == "d" to be 15 in the data frame df. I don't want to transform the list of 2 data frames to 2 single data frames but select the row where y == "d" from the data frame df within the list. How could I do this?
We can do this with lapply to loop through the list, extract the column 'x' where 'y' is "d" and assign it to 15; Make sure to return the whole dataset afterwards
lapply(dat, function(v) {
v$y <- as.character(v$y)
v$x[v$y =="d"] <- 15
v})
If the assignment needs to be only for a particular dataset
dat$df$x[dat$df$y=="d"] <- 15
Also, if the assignment needs to be done by doing a check on the names of the list, then loop through the names and then do the assignment
for(nm in names(dat)) if(nm == "df") dat[[nm]]$x[dat[[nm]]$y=='d'] <- 15
You can try a tidyverse as well
library(tidyverse)
map(dat, ~mutate(., x = ifelse(y == "d", 15, x)))
$df
x y
1 10 a
2 55 a
3 32 b
4 78 b
5 47 c
6 15 d
$df1
x y
1 7.30 a
2 5.65 a
3 3.72 b
4 7.81 b
5 4.79 c
This will help you replace NA with 15.
dat[[1]][1][6,] <- 15

subset a dataframe based on sum of a column

I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362

Resources