I have two dataframes with differenth lengths. On is a sample and the other a test sample
df1 a b c d ...
1 0 0 0
2 0 0 1
df2 a e b c d ...
1 1 0 0 0
2 0 0 0 1
How can I delete the columns of df2 not in common with df1 ?
As a result I'm looking for df2 with the same columns as df1 (a, b, c, d ...).
I tried merge() but its not what i'm looking for.
If I understand your question correctly you can subset by the column-names like this:
df2[, colnames(df1)]
If you have column names in df1 not present in df2 you can do
df2[, intersect(colnames(df1), colnames(df2))]
Edit: forgot a comma
Related
I have 2 dfs (simplified example):
df1 a b c g ...
1 0 0 0
2 0 0 1
And
df2 a b d e f ...
1 1 0 0 0
2 0 0 0 1
I would like to merge the 2 dfs but before joining I would like to remove common columns in df1 and df2. So I would retain columns (c,d,e,f,g) as a and b are common in df1 and df2.
So basically doing the opposite of what was answered here:
delete columns in data frame not in common with another (R)
Using set operations viz. union intersect and setdiff on names of both dfs, we may do this
df1 <- read.table(header = T, text = 'a b c g
1 0 0 0
2 0 0 1')
df2 <- read.table(header = T, text = 'a b d e f
1 1 0 0 0
2 0 0 0 1')
# uncommon column names
x <- setdiff(union(names(df1), names(df2)), intersect(names(df1), names(df2)))
cbind(df1[names(df1) %in% x], df2[names(df2) %in% x])
#> c g d e f
#> 1 0 0 0 0 0
#> 2 0 1 0 0 1
Created on 2021-06-15 by the reprex package (v2.0.0)
In base R, you can start by using the duplicated function to work out which column names both data frames have in common. From there, it's just a matter of selecting and binding the columns from each data frame that are not on this list.
dupes <- c(names(df1), names(df2))[duplicated(c(names(df1), names(df2)))]
df3 <- cbind(df1[, -which(names(df1) %in% dupes)], df2[, -which(names(df2) %in% dupes)])
Following your example, this would produce the following data frame, consisting only of the unique columns from each of the others. This is based on the assumption that both data frames have the same number of rows.
df3 c g d e f ...
0 0 0 0 0
0 1 0 0 1
I'm trying to create a new dataset from an existing one. The new dataset is supposed to combine 60 rows from the original dataset in order to convert a sum of events occurring each second to the total by minute. The number of columns will generally not be known in advance.
For example, with this dataset, if we split it into groups of 3 rows:
d1
a b c d
1 1 1 0 1
2 0 1 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 1 0
6 1 0 0 0
We'll get this data.frame. Row 1 contains the column sums for rows 1-3 of d1 and Row 2 contains the column sums for rows 4-6 of d1:
d2
a b c d
1 1 3 0 2
2 1 0 2 0
I've tried d2<-colSums(d1[seq(1,NROW(d1),3),]) which is about as close as I've been able to get.
I've also considered recommendations from How to sum rows based on multiple conditions - R?,How to select every xth row from table,Remove last N rows in data frame with the arbitrary number of rows,sum two columns in R, and Merging multiple rows into single row. I'm all out of ideas. Any help would be greatly appreciated.
Create a grouping variable, group_by that variable, then summarise_all.
# your data
d <- data.frame(a = c(1,0,0,0,0,1),
b = c(1,1,1,0,0,0),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
# create the grouping variable
d$group <- rep(c("A","B"), each = 3)
# apply the mean to all columns
library(dplyr)
d %>%
group_by(group) %>%
summarise_all(funs(sum))
Returns:
# A tibble: 2 x 5
group a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 3 0 2
2 B 1 0 3 0
Overview
After reading Split up a dataframe by number of rows, I realized the only thing you need to know is how you'd like to split() d1.
In this case, you'd like to split d1 into multiple data frames based on every 3 rows. In this case, you use rep() to specify that you'd like each element in the sequence - 1:2 - to be repeated three times (the number of rows divided by the length of your sequence).
After that, the logic involves using map() to sum each column for each data frame created after d1 %>% split(). Here, summarize_all() is helpful since you don't need to know the column names ahead of time.
Once the calculations are complete, you use bind_rows() to stack all the observations back into one data frame.
# load necessary package ----
library(tidyverse)
# load necessary data ----
df1 <-
read.table(text = "a b c d
1 1 0 1
0 1 0 1
0 1 0 0
0 0 1 0
0 0 1 0
1 0 0 0", header = TRUE)
# perform operations --------
df2 <-
df1 %>%
# split df1 into two data frames
# based on three consecutive rows
split(f = rep(1:2, each = nrow(.) / length(1:2))) %>%
# for each data frame, apply the sum() function to all the columns
map(.f = ~ .x %>% summarize_all(.funs = funs(sum))) %>%
# collapse data frames together
bind_rows()
# view results -----
df2
# a b c d
# 1 1 3 0 2
# 2 1 0 2 0
# end of script #
I have a list of data frames. Each has an ID column, followed by a number of numeric columns (with column names).
I would like to replace all the 1's with 0's for all the numeric columns, but keep the ID column the same. I can do this in part with a single data frame using
df[,-1] <- 0
But when I try to embed this in lapply, it fails:
df2 <- lapply(df, function(x) {x[,-1] <- 0})
I've tried using subset, ifelse, while, mutate, but struggling with this simple replacement. Could recreate the data frames from scratch, or recombine the ID column at the end, but this strikes me as something that should be easy...
Test list:
test_list <- list(data.frame("ID"=letters[1:3], "col2"=1:3, "col3"=0:2), data.frame("ID"=letters[4:6], "col2"=4:6, "col3"=0:2))
The end result should be:
final_list <- list(data.frame("ID"=letters[1:3], "col2"=0, "col3"=0), data.frame("ID"=letters[4:6], "col2"=0, "col3"=0))
Add return(x) to your function and then it should work fine.
lapply(test_list, function(x){
x[, -1] <- 0
return(x)
})
# [[1]]
# ID col2 col3
# 1 a 0 0
# 2 b 0 0
# 3 c 0 0
#
# [[2]]
# ID col2 col3
# 1 d 0 0
# 2 e 0 0
# 3 f 0 0
Your question is worded a little bit strangely in that it sounds like you want to replace all the 1's with 0's, but your example seems to contradict that.
If you want to replace just 1's with 0's, you could do so like this:
lapply(df, function(x) {x[x==1] <- 0; return(x)})
[[1]]
ID col2 col3
1 a 0 0
2 b 2 0
3 c 3 2
[[2]]
ID col2 col3
1 d 4 0
2 e 5 0
3 f 6 2
I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))
This question already has answers here:
Create columns from factors and count [duplicate]
(2 answers)
Closed 7 years ago.
To get the question clear, let me start with one baby example of my data frame.
ID <- c(rep("first", 2), rep("second", 4), rep("third",1), rep("fourth", 3))
Var_1 <- c(rep("A",2), rep("B", 2), rep("A",3), rep("B", 2), "A")
Var_2 <- c(rep("C",2), rep("D",3) , rep("C",2), rep("E",2), "D")
DF <- data.frame(ID, Var_1, Var_2)
> DF
ID Var_1 Var_2
1 first A C
2 first A C
3 second B D
4 second B D
5 second A D
6 second A C
7 third A C
8 fourth B E
9 fourth B E
10 fourth A D
There is one ID factor variable and two factor variables Var_1 with R=2 factor levels and Var_2 with C=3 factor levels.
I would like to get a new data frame with (RxC)+1=(2x3)+1 Variables with the frequencies of all combinations of factor levels - separately for each level in ID Variable, that would look like this:
ID A.C A.D A.E B.C B.D B.E
1 first 2 0 0 0 0 0
2 second 1 1 0 0 2 0
3 third 1 0 0 0 0 0
4 fourth 0 1 0 0 0 2
I tried a couple of functions, but results were not even close to this, so they are not even worth of mentioning. In original data frame I should get (6x9)+1=55 Variables.
EDIT: There are solutions for counting factor levels for one or many variables separatly, but I couldnĀ“t figure it out how to make a common counts for combinations of factor levels for two (or more) variables. Implementig the solution to others seems easy now when I got the answers, but I could not get there by myself.
Using the dcast function from the reshape package (or data.table which has an enhanced implementation of the dcast function):
library(reshape2)
dcast(DF, ID ~ paste(Var_1,Var_2,sep="."), fun.aggregate = length)
which gives:
ID A.C A.D B.D B.E
1 first 2 0 0 0
2 fourth 0 1 0 2
3 second 1 1 2 0
4 third 1 0 0 0
We could use paste to create a variable combining Var_1 and Var_2 and then produce a contingency table with ID and the new variable:
table(DF$ID,paste(DF$Var_1,DF$Var_2,sep="."))
output
A.C A.D B.D B.E
first 2 0 0 0
fourth 0 1 0 2
second 1 1 2 0
third 1 0 0 0
To order the table rows, we would need to factor(DF$ID,levels=c("first","second","third","fourth")) beforehand.
Try
library(tidyr)
library(dplyr)
DF %>%
unite(Var, Var_1, Var_2, sep = ".") %>%
count(ID, Var) %>%
spread(Var, n, fill = 0)
Which gives:
#Source: local data frame [4 x 5]
#
# ID A.C A.D B.D B.E
# (fctr) (dbl) (dbl) (dbl) (dbl)
#1 first 2 0 0 0
#2 fourth 0 1 0 2
#3 second 1 1 2 0
#4 third 1 0 0 0