Related
Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5
I would like to ask how can I rearrange my dataset that fulfils the following
[
Original :
Group Value_y Value_z
1 m a
1 n a
2 o b
2 p b
Intended:
Group Value_a Value_b
1 m n
2 o p
]1
which involves separating value_y according to value_z and adding a new column according to the group number. Will potential need to average a separate column's values and add as a new column the same way.
Thank you!
In data.table we can use dcast :
library(data.table)
dcast(setDT(df), Group~rowid(Value_z), value.var = 'Value_y')
# Group 1 2
#1: 1 m n
#2: 2 o p
data
df <- structure(list(Group = c(1L, 1L, 2L, 2L), Value_y = c("m", "n",
"o", "p"), Value_z = c("a", "a", "b", "b")), class = "data.frame",
row.names = c(NA, -4L))
There is a dplyr solution. Define
Uneven = seq(1, dim(A)[1] - 1, by = 2)
Even = seq(2, dim(A)[1], by = 2)
with
A = data.frame(Group = c(1, 1, 2, 2), Value_y = c("m", "n", "o", "p"))
Then, you can use the pipe and some dplyr functionality to get
A2 = A %>%
dplyr::group_by(Group) %>%
dplyr::mutate(Row_1 = Value_y[Uneven]) %>%
dplyr::mutate(Row_2 = Value_y[Even]) %>%
dplyr::select(-Value_y) %>%
dplryr::slice(1)
and the output
> A2
# A tibble: 2 x 3
# Groups: Group [2]
Group Row_1 Row_2
<dbl> <fct> <fct>
1 1 m n
2 2 o p
Note that this solution presupposes two-pairs of Groups, i.e. an even number of observations.
I am trying to prep my data and I am stuck with one issue. Lets say I have the following data frame:
df1
Name C1 Val1
A a x1
A a x2
A b x3
A c x4
B d x5
B d x6
...
and I want to narrow down the df to
df2
Name C1 Val
A a,b,c x1+x2+x3+x4
B d x5+x6
...
while a is a character value and x is numeric value
I have been trying using sapply, rowsum and
df2<- aggregate(df1, list(df1[,1]), FUN= summary)
but it just can't put the character values in a list for each Name.
Can someone help me how to receive df2?
m <- function(x) if(is.numeric(x<- type.convert(x)))sum(x) else toString(unique(x))
aggregate(.~Name,df1,m)
Name C1 Val1
1 A a, b, c 10
2 B d 11
where
df1
Name C1 Val1
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B d 5
6 B d 6
This is your df, I give it numbers 1 to 6 in Val1
df <-
structure(list(Name = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), C1 = structure(c(1L, 1L, 2L, 3L, 4L,
4L), .Label = c("a", "b", "c", "d"), class = "factor"), Val1 = 1:6), row.names = c(NA,
-6L), class = "data.frame")
We just use summarise:
df %>%
group_by(Name) %>%
summarise(C1=paste(unique(C1),collapse=","),Val1=sum(Val1))
# A tibble: 2 x 3
Name C1 Val1
<fct> <chr> <int>
1 A a,b,c 10
2 B d 11
Quick and easy dplyr solution:
library(dplyr)
library(stringr)
df1 %>%
mutate(Val1_num = as.numeric(str_extract(Val1, "\\d+"))) %>%
group_by(Name) %>%
summarise(C1 = paste(unique(C1), collapse = ","),
Val1 = paste(unique(Val1), collapse = "+"),
Val1_num = sum(Val1_num))
#> # A tibble: 2 x 4
#> Name C1 Val1 Val1_num
#> <chr> <chr> <chr> <dbl>
#> 1 A a,b,c x1+x2+x3+x4 10
#> 2 B d x5+x6 11
Or in base:
df2 <- aggregate(df1, list(df1[,1]), FUN = function(x) {
if (all(grepl("\\d", x))) {
sum(as.numeric(gsub("[^[:digit:]]", "", x)))
} else {
paste(unique(x), collapse = ",")
}
})
df2
#> Group.1 Name C1 Val1
#> 1 A A a,b,c 10
#> 2 B B d 11
data
df1 <- read.csv(text = "
Name,C1,Val1
A,a,x1
A,a,x2
A,b,x3
A,c,x4
B,d,x5
B,d,x6", stringsAsFactors = FALSE)
I have two data frames. dfOne is made like this:
X Y Z T J
3 4 5 6 1
1 2 3 4 1
5 1 2 5 1
and dfTwo is made like this
C.1 C.2
X Z
Y T
I want to obtain a new dataframe where there are simultaneously X, Y, Z, T Values which are major than a specific threshold.
Example. I need simultaneously (in the same row):
X, Y > 2
Z, T > 4
I need to use the second data frame to reach my objective, I expect something like:
dfTwo$C.1>2
so the result would be a new dataframe with this structure:
X Y Z T J
3 4 5 6 1
How could I do it?
Here is a base R method with Map and Reduce.
# build lookup table of thresholds relative to variable name
vals <- setNames(c(2, 2, 4, 4), unlist(dat2))
# subset data.frame
dat[Reduce("&", Map(">", dat[names(vals)], vals)), ]
X Y Z T J
1 3 4 5 6 1
Here, Map returns a list of length 4 with logical variables corresponding to each comparison. This list is passed to Reduce which returns a single logical vector with length corresponding to the number of rows in the data.frame, dat. This logical vector is used to subset dat.
data
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
dat2 <-
structure(list(C.1 = structure(1:2, .Label = c("X", "Y"), class = "factor"),
C.2 = structure(c(2L, 1L), .Label = c("T", "Z"), class = "factor")), .Names = c("C.1",
"C.2"), class = "data.frame", row.names = c(NA, -2L))
We can use the purrr package
Here is the input data.
# Data frame from lmo's solution
dat <-
structure(list(X = c(3L, 1L, 5L), Y = c(4L, 2L, 1L), Z = c(5L,
3L, 2L), T = c(6L, 4L, 5L), J = c(1L, 1L, 1L)), .Names = c("X",
"Y", "Z", "T", "J"), class = "data.frame", row.names = c(NA,
-3L))
# A numeric vector to show the threshold values
# Notice that columns without any requirements need NA
vals <- c(X = 2, Y = 2, Z = 4, T = 4, J = NA)
Here is the implementation
library(purrr)
map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) %>% na.omit()
# A tibble: 1 x 5
X Y Z T J
<int> <int> <int> <int> <int>
1 3 4 5 6 1
map2_dfc loop through each column in dat and each value in vals one by one with a defined function. ~ifelse(.x > .y | is.na(.y), .x, NA) means if the number in each column is larger than the corresponding value in vals, or vals is NA, the output should be the original value from the column. Otherwise, the value is replaced to be NA. The output of map2_dfc(dat, vals, ~ifelse(.x > .y | is.na(.y), .x, NA)) is a data frame with NA values in some rows indicating that the condition is not met. Finally, na.omit removes those rows.
Update
Here I demonstrate how to covert the dfTwo dataframe to the vals vector in my example.
First, let's create the dfTwo data frame.
dfTwo <- read.table(text = "C.1 C.2
X Z
Y T",
header = TRUE, stringsAsFactors = FALSE)
dfTwo
C.1 C.2
1 X Z
2 Y T
To complete the task, I load the dplyr and tidyr package.
library(dplyr)
library(tidyr)
Now I begin the transformation of dfTwo. The first step is to use stack function to convert the format.
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group))
dfTwo2
Col Group
1 X C.1
2 Y C.1
3 Z C.2
4 T C.2
The second step is to add the threshold information. One way to do this is to create a look-up table showing the association between Group and Value
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
threshold_df
Group Value
1 C.1 2
2 C.2 4
And then we can use the left_join function to combine the data frame.
dfTwo3 <- dfTwo2 %>% left_join(threshold_dt, by = "Group")
dfTwo3
Col Group Value
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
Now it is the third step. Notice that there is a column called J which does not need any threshold. So we need to add this information to dfTwo3. We can use the complete function from tidyr. The following code completes the data frame by adding Col in dat but not in dfTwo3 and NA to the Value.
dfTwo4 <- dfTwo3 %>% complete(Col = colnames(dat))
dfTwo4
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 J <NA> NA
2 T C.2 4
3 X C.1 2
4 Y C.1 2
5 Z C.2 4
The fourth step is arrange the right order of dfTwo4. We can achieve this by turning Col to factor and assign the level based on the order of the column name in dat.
dfTwo5 <- dfTwo4 %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
dfTwo5
# A tibble: 5 x 3
Col Group Value
<chr> <chr> <dbl>
1 X C.1 2
2 Y C.1 2
3 Z C.2 4
4 T C.2 4
5 J <NA> NA
We are almost there. Now we can create vals from dfTwo5.
vals <- dfTwo5$Value
names(vals) <- dfTwo5$Col
vals
X Y Z T J
2 2 4 4 NA
Now we are ready to use the purrr package to filter the data.
The aboved are the breakdown of steps. We can combine all these steps into the following code for simlicity.
library(dplyr)
library(tidyr)
threshold_df <- data.frame(Group = c("C.1", "C.2"),
Value = c(2, 4),
stringsAsFactors = FALSE)
dfTwo2 <- dfTwo %>%
stack() %>%
setNames(c("Col", "Group")) %>%
mutate(Group = as.character(Group)) %>%
left_join(threshold_df, by = "Group") %>%
complete(Col = colnames(dat)) %>%
mutate(Col = factor(Col, levels = colnames(dat))) %>%
arrange(Col) %>%
mutate(Col = as.character(Col))
vals <- dfTwo2$Value
names(vals) <- dfTwo2$Col
dfOne[Reduce(intersect, list(which(dfOne["X"] > 2),
which(dfOne["Y"] > 2),
which(dfOne["Z"] > 4),
which(dfOne["T"] > 4))),]
# X Y Z T J
#1 3 4 5 6 1
Or iteratively (so fewer inequalities are tested):
vals = c(X = 2, Y = 2, Z = 4, T = 4) # from #lmo's answer
dfOne[Reduce(intersect, lapply(names(vals), function(x) which(dfOne[x] > vals[x]))),]
# X Y Z T J
#1 3 4 5 6 1
I'm writing this assuming that the second DF is meant to categorize the fields in the first DF. It's way simpler if you don't need to use the second one to define the conditions:
dfNew = dfOne[dfOne$X > 2 & dfOne$Y > 2 & dfOne$Z > 4 & dfOne$T > 4, ]
Or, using dplyr:
library(dplyr)
dfNew = dfOne %>% filter(X > 2 & Y > 2 & Z > 4 & T > 4)
In case that's all you need, I'll save this comment while I poke at the more complicated version of the question.
I have a dataframe looking like this:
id value1 value2 value3 value4
A 14 24 22 9
B 51 25 29 33
C 4 16 8 10
D 1 4 2 4
Now I want to compare each column of the row with the others rows in order to identify the rows where every value is higher.
So, for example for id D this would be A, B and C.
For C it would be B, for A it's B and for B there is no row.
I tried to do that by looping through the rows and comparing every column, but that takes a lot of time. The original dataset has about 5000 rows and 20 columns to compare.
I am sure that there is a way to do that more efficiently. Thanks for your help!
I don't know a simple function to do this task.
Here is how I would do.
library(dplyr)
DF <- data.frame(
id = c("A", "B", "C", "D"),
value1 = c(14, 51, 4, 1),
value2 = c(24, 25, 16, 4),
value3 = c(22, 29, 8, 2),
value4 = c(9, 33, 10, 4),
stringsAsFactors = FALSE)
# get the order for each value
tmp <- lapply(select(DF, -id), function(x) DF$id[order(x)])
# find a set of "biggers" for each id
tmp <- lapply(tmp, function(x) data.frame(
id = rep(x, rev(seq_along(x))-1),
bigger = x[lapply(seq_along(x), function(i)
which(seq_along(x) > i)) %>% unlist()],
stringsAsFactors = FALSE))
# inner_join all, this keeps "biggers" in all columns
out <- NULL
for (v in tmp) {
if (is.null(out)) {
out <- v
} else {
out <- inner_join(out, v, by = c("id", "bigger"))
}
}
This gets you:
out
# id bigger
#1 D C
#2 D A
#3 D B
#4 C B
#5 A B
Here's an approach that returns results in a data frame format.
library(tidyr)
library(dplyr)
# reshape data to long format
td <- d %>% gather(key, value, value1:value4)
# create a copy w/ different names for merging
td2 <- td %>% select(id2 = id, key, value2 = value)
# full outer join to produce one row per pair of IDs
dd <- merge(td, td2, by = "key", all = TRUE)
# the result
dd %>%
filter(id != id2) %>%
group_by(id, id2) %>%
summarise(all_less = !any(value >= value2)) %>%
filter(all_less)
results (id is less than id2)
id id2 all_less
(fctr) (fctr) (lgl)
1 A B TRUE
2 C B TRUE
3 D A TRUE
4 D B TRUE
5 D C TRUE
data
d <- structure(list(
id = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor"),
value1 = c(14L, 51L, 4L, 1L),
value2 = c(24L, 25L, 16L, 4L),
value3 = c(22L, 29L, 8L, 2L), value4 = c(9L, 33L, 10L, 4L)
),
.Names = c("id", "value1", "value2", "value3", "value4"),
class = "data.frame", row.names = c(NA, -4L)
)
I think this works just fine:
ind <- which(names(df) == "id")
apply(df[,-ind],1,function(x) df$id[!rowSums(!t(x < t(df[,-ind])))] )
# [[1]]
# [1] "B"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "B"
#
# [[4]]
# [1] "A" "B" "C"