In an experiment, people had four candidates to choose from; sometimes they're male, other times they're female. In the below dataframe, C1 means Candidate 1, C2 means Candidate 2, and so on. F denotes female while M denotes male. A response of 1 indicates the person chose C1, a response of 2 indicates the person chose C2, and so on.
C1 C2 C3 C4 response
F F M M 2
M M F M 1
I want a new column "ChooseFemale" which equals to 1 if the candidate chose a female candidate, and zero otherwise. So the first row should have ChooseFemale equal to 1, while the second row should have ChooseFemale equal to zero.
This would require me to look up a certain column depending on the value of "response" column.
How can I do this?
Another base R solution:
x <- df[["response"]]
df$ChooseFemale <- as.integer(df[cbind(seq_along(x), x)] == "F")
C1 C2 C3 C4 response ChooseFemale
1 F F M M 2 1
2 M M F M 1 0
Data:
Lines <- "C1 C2 C3 C4 response
F F M M 2
M M F M 1"
df <- read.table(text = Lines, header = TRUE, stringsAsFactors = FALSE)
# create dataframe
my.df <- data.frame(c1=c('f','m'),
c2=c('f','m'),
c3=c('m','f'),
c4=c('m','m'),
resp=c(2, 1))
# add column
my.df$ChooseFemale <- NA
# loop over rows
for (row in 1:nrow(my.df)){
# extract the column to check from response column
col <- paste0('c', my.df$resp[row])
# fill in new column
my.df$ChooseFemale[row] <- ifelse(my.df[row, col]=='f', 1, 0)
}
apply(df,1,function(x) ifelse(df[,as.numeric(x['response'])]=='F',1,0))[,1]
[1] 1 0
Here is the basic idea, select the column using the value in response. Then use apply with MARGIN=1 to apply this function row by row.
df[1,'response']
[1] 2
df[1,df[1,'response']]
[1] F
Levels: F M
data
df <- read.table(text = "
C1 C2 C3 C4 response
F F M M 2
M M F M 1
",header=T)
You can create a simple function to check whether the response number matches "F", and then apply it to each row at once.
A tidyverse approach:
library(tidyverse)
mydata <- data.frame(C1=sample(c("F","M"),10,replace = T),
C2=sample(c("F","M"),10,replace = T),
C3=sample(c("F","M"),10,replace = T),
C4=sample(c("F","M"),10,replace = T),
response=sample(c(1:4),10,replace = T),
stringsAsFactors = FALSE)
C1 C2 C3 C4 response
1 M M M M 1
2 F F F M 4
3 M F M M 2
4 F M M F 2
5 M M M F 1
6 M F M F 4
7 M M M F 3
8 M M M M 2
9 M F M M 3
10 F F M F 4
Custom function to check if the response matches "F"
female_choice <- function(C1, C2, C3, C4, response) {
c(C1, C2, C3, C4)[response] == "F"
}
And then just use mutate() to modify your dataframe, and pmap() to use its rows, one by one, as the set of arguments for female_choice()
mydata %>%
mutate(ChooseFemale = pmap_chr(., female_choice))
C1 C2 C3 C4 response ChooseFemale
1 M M M M 1 FALSE
2 F F F M 4 FALSE
3 M F M M 2 TRUE
4 F M M F 2 FALSE
5 M M M F 1 FALSE
6 M F M F 4 TRUE
7 M M M F 3 FALSE
8 M M M M 2 FALSE
9 M F M M 3 FALSE
10 F F M F 4 TRUE
Here is one way to do it using tidyverse packages. As specified in the question, this takes into account both which candidate was chosen (C1-C4) and sex of the candidate (F/M):
# loading needed libraries
library(tidyverse)
# data
df <- utils::read.table(text = "C1 C2 C3 C4 response
F F M M 2
M M F M 1", header = TRUE) %>%
tibble::as_data_frame(x = .) %>%
tibble::rowid_to_column(.)
# manipulation
dplyr::full_join(
# creating dataframe with the new chooseFemale variable
x = df %>%
tidyr::gather(
data = .,
key = "candidate",
value = "choice",
C1:C4
) %>%
dplyr::mutate(choice_new = paste("C", response, sep = "")) %>%
# creating the needed column by checking both the candidate chosen and
# the sex of the candidate
dplyr::mutate(chooseFemale = dplyr::case_when((choice_new == candidate) &
(choice == "F") ~ 1,
(choice_new == candidate) &
(choice == "M") ~ 0
)) %>%
dplyr::select(.data = ., -choice_new) %>%
tidyr::spread(data = ., key = candidate, value = choice) %>%
dplyr::filter(.data = ., !is.na(chooseFemale)) %>%
dplyr::select(.data = ., -c(C1:C4)),
# original dataframe
y = df,
by = c("rowid", "response")
) %>% # removing the redundant row id
dplyr::select(.data = ., -rowid) %>% # rearranging the columns
dplyr::select(.data = ., C1:C4, response, chooseFemale)
#> # A tibble: 2 x 6
#> C1 C2 C3 C4 response chooseFemale
#> <fct> <fct> <fct> <fct> <int> <dbl>
#> 1 F F M M 2 1
#> 2 M M F M 1 0
Created on 2018-08-24 by the reprex package (v0.2.0.9000).
I'll provide an answer in the tidyr format. Your data is in a "wide" format. This makes it very human readable, but not necessarily machine readable. The first step to making it more tidy is to convert the data to long format. In other words, let's transform the data so that we don't have to do calculations across multiple columns in a single row.
tidy format allows you to use grouping variables, create summaries, etc.
library(dplyr)
library(tidyr)
df <- data.frame(C1 = c("F","M"),
C2 = c("F","M"),
C3 = c("M","F"),
C4 = c("M","M"),
stringsAsFactors = FALSE)
> df
C1 C2 C3 C4
1 F F M M
2 M M F M
Let's add an "id" field so we can keep track of each unique row. This is the same as the row number...but we are going to be converting the wide data to long data with different row numbers. Then use gather to convert from wide data to long data.
df_long <- df %>%
mutate(id = row_number(C1)) %>%
gather(key = "key", value = "value",C1:C4)
> df_long
id key value
1 1 C1 F
2 2 C1 M
3 1 C2 F
4 2 C2 M
5 1 C3 M
6 2 C3 F
7 1 C4 M
8 2 C4 M
Now it is possible to use group_by() to group based on variables, perform summaries, etc.
For what you've asked you group by the id column and then perform calculations on the group. In this case we will take the sum of all values that are "F". Then we ungroup and spread back to the wide / human readable format.
df_long %>%
group_by(id) %>%
mutate(response = sum(value=="F",na.rm=TRUE)) %>%
ungroup()
> df_long
# A tibble: 8 x 4
id key value response
<int> <chr> <chr> <int>
1 1 C1 F 2
2 2 C1 M 1
3 1 C2 F 2
4 2 C2 M 1
5 1 C3 M 2
6 2 C3 F 1
7 1 C4 M 2
8 2 C4 M 1
To get the data back in wide format once you are done doing all calculations that you need in long format:
df <- df_long %>%
spread(key,value)
> df
# A tibble: 2 x 6
id response C1 C2 C3 C4
<int> <int> <chr> <chr> <chr> <chr>
1 1 2 F F M M
2 2 1 M M F M
To get the data back in the order you had it:
df <- df %>%
select(-id) %>%
select(C1:C4,everything())
> df
# A tibble: 2 x 5
C1 C2 C3 C4 response
<chr> <chr> <chr> <chr> <int>
1 F F M M 2
2 M M F M 1
You can of course use the pipes to do this all in one step.
df <- df %>%
mutate(id = row_number(C1)) %>%
gather(key = "key", value = "value",C1:C4) %>%
group_by(id) %>%
mutate(response = sum(value=="F",na.rm=TRUE)) %>%
ungroup() %>%
spread(key,value) %>%
select(-id) %>%
select(C1:C4,everything())
Related
I need a function f(B,A) that, given a dataset with the following structure,
T1 T2 T3 T4 T5 ... P1 P2 P3 P4 P5 ...
1 2 5 8 9 ... A C B B A ...
1 3 4 6 6 ... C A C A B ...
finds the first time B and A appear in Pj columns (starting with j=1) and returns the value difference in the corresponding Ti columns.
For instance:
in line 1: B appears in P3 first, A appears in P1 first. Then:
f(B, A) = T3 - T1 = 5-1 = 4
in line 2: B appears in P5 first, A appears in P2 first. Then:
f(B, A) = T5 - T2 = 6-3 = 3
I can find in which Pj columns B and A appear using str_detect() , but I don't know how to "move" from P_j1, P_j2 to T_j1, T_j2.
Using datatable syntax (or base R) will be appreciated
Here is a data.table approach.
library(data.table)
DT <- fread("T1 T2 T3 T4 T5 P1 P2 P3 P4 P5
1 2 5 8 9 A C B B A
1 3 4 6 6 C A C A B")
# Add row ID's
DT[, id := .I]
#melt to a long format
DT.melt <- data.table::melt(DT,
id.vars = "id",
measure.vars = patterns(T = "^T", P = "^P"))
# Find first B for each id
val1 <- DT.melt[P == "B", T[1], by = .(id)]$V1
# [1] 5 6
# Find first A for each id
val2 <- DT.melt[P == "A", T[1], by = .(id)]$V1
# [1] 1 3
val1 - val2
# [1] 4 3
base R
f <- function(l1, l2){
apply(df, 1, function(x){
dfP <- x[grepl("P", names(x))]
dfT <- x[grepl("T", names(x))]
as.numeric(dfT[which(dfP == l1)[1]]) - as.numeric(dfT[which(dfP == l2)[1]])
})
}
f("B", "A")
[1] 4 3
Tidyverse
With this type of data, it's usually best to pivot to long and then back to wide: here is a tidyverse solution, with diff being the desired output.
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_pattern = "(\\D)(\\d)",
names_to = c(".value", "group")) %>%
group_by(id) %>%
mutate(diff = first(T[P == "B"]) - first(T[P == "A"])) %>%
pivot_wider(c(id, diff), names_from = group, values_from = c(T, P), names_sep = "")
output
id diff T1 T2 T3 T4 T5 P1 P2 P3 P4 P5
<int> <int> <int> <int> <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
1 1 4 1 2 5 8 9 A C B B A
2 2 3 1 3 4 6 6 C A C A B
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me
You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g
You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g
I had a column with multiple values inside it..
Like...
ColumnX1
A,D,C,B,F,E,G
F,A,B,E,G,C
C,D,G,F,A,T
I splitted the data with
Species_Data2 <- data.frame(str_split_fixed(Species_Data$Other.Anopheline.species, ",", 21))
But I got the values as below:
I have dataframe like:-
X1 X2 X3 X4 X5 X6 X7
A D C B F E G
F A B E G NA C
C D G F A T NA
I wanted to make a dataframe like:
X1 X2 X3 X4 X5 X6 X7 X8
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
and then....
I want to make the columns names as row values:-
Colnames
'A' 'B' 'C' 'D' 'E' 'F' 'G' 'T'
A B C D E F G NA
A B C NA E F G NA
A NA C D NA F G T
Tried to create sorting...but does not work that great... :(..
Comes up with O values though....
If I understand correctly, the OP wants to rearrange the data so that there is a separate column for each letter. If a letter is present in a row, then the letter appears in the appropriate column/row of the reshaped data. NA indicates that a letter is missing in a row. In addition, the letter columns should be arranged in alphabetical order.
1. dplyr/tidyr approach
If we start with the data.frame resulting from OP's call to stringr::str_split_fixed() we need to reshape the splitted data from wide to long format, remove empty entries, order rows so that columns appear in letter order and reshape to wide format again. For reshaping, a row id is required. To achieve the desired output, pivot_wide() has to be called the names_from = value parameter:
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value)
rn A B C D E F G T
<int> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 1 A B C D E F G NA
2 2 A B C NA E F G NA
3 3 A NA C D NA F G T
2. data.table approach
If we start from the unsplitted original data, there is a much more concise variant which uses data.table's dcast() for reshaping:
library(data.table)
setDT(DF)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
If required, the additional row id column can be removed in both approaches.
Data
DF <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G",
"F,A,B,E,G,C",
"C,D,G,F,A,T")
)
EDIT: Duplicate values
In a comment, the OP has disclosed that the production dataset contains duplicate values.
In case of duplicate values, dcast() uses the length() function by default to aggregate the data.
With a modified dataset DF2 which contains duplicate values in rows 1 and 2, the original data.table approach returns:
library(data.table)
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][, dcast(.SD, nrow ~ V1)]
nrow A B C D E F G T
1: 1 1 1 2 1 1 1 1 0
2: 2 1 1 1 0 1 2 1 0
3: 3 1 0 1 1 0 1 1 1
Here, the number of duplicate letters is shown.
The expected behaviour can be restored by removing the duplicate values before reshaping by using unique():
setDT(DF2)[, stringr::str_split(ColumnX1, ","), by = 1:nrow(DF)][
, dcast(unique(.SD), nrow ~ V1)]
nrow A B C D E F G T
1: 1 A B C D E F G <NA>
2: 2 A B C <NA> E F G <NA>
3: 3 A <NA> C D <NA> F G T
Also the dplyr/tidyr approach needs to be modified by specifying an appropriate aggregation function in the call to pivot_wider():
library(dplyr)
library(tidyr)
as.data.frame(stringr::str_split_fixed(DF2$ColumnX1, ",", 21)) %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn) %>%
filter(value != "") %>%
arrange(as.character(value)) %>%
pivot_wider(rn, names_from = value, values_fn = list(value = unique))
Data with duplicate values
DF2 <- data.frame(ColumnX1 = c("A,D,C,B,F,E,G,C",
"F,A,B,E,G,C,F",
"C,D,G,F,A,T")
)
A data frame is given and the objective is to calculate the direct dependency value between two columns of the data frame.
c1 c2 N
a b 30
a c 5
a d 10
c a 5
b a 10
what we are looking for is that to get the direct dependency relations, for example, for aand b this value is ab - ba = 20.
The final result should be like this:
c1 c2 N DepValue
a b 30 ab - ba = 20
a c 5 ac - ca = 0
a d 10 ad- 0 = 10
c a 5 ca - ac= 0
b a 10 ba - ab = 20
Thank you for your help.
D <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"c1 c2 N
a b 30
a c 5
a d 10
c a 5
b a 10")
N12 <- D$N
names(N12) <- paste0(D$c1, D$c2)
N21 <- N12[paste0(D$c2, D$c1)]
D$depValue <- D$N - ifelse(is.na(N21), 0, N21)
result:
> D
c1 c2 N depValue
1 a b 30 20
2 a c 5 0
3 a d 10 10
4 c a 5 0
5 b a 10 -20
One option is to create groups with pmin and pmax values of c1 and c2 and take difference between the two values. This will return NA for groups with only one value, we can replace those NAs to the first value in the group.
library(dplyr)
df %>%
group_by(group1 = pmin(c1, c2), group2 = pmax(c1, c2)) %>%
mutate(dep = N[1] - N[2],
dep = replace(dep, is.na(dep), N[1])) %>%
ungroup() %>%
select(-group1, -group2)
# c1 c2 N dep
# <chr> <chr> <int> <int>
#1 a b 30 20
#2 a c 5 0
#3 a d 10 10
#4 c a 5 0
#5 b a 10 20
An idea via base R is to sort columns c1 and c2, split based on those values and subtract N, i.e.
i1 <- paste(pmin(df$c1, df$c2), pmax(df$c1, df$c2))
i1
#[1] "a b" "a c" "a d" "a c" "a b"
do.call(rbind, lapply(split(df, i1), function(i) {i['DepValue'] <- Reduce(`-`, i$N); i}))
# c1 c2 N DepValue
#a b.1 a b 30 20
#a b.5 b a 10 20
#a c.2 a c 5 0
#a c.4 c a 5 0
#a d a d 10 10
I have the use-case shown below. Basically I have a data frame with three columns. I want to group by two columns (c1,c2) and sum the third one c3. Then I want to pick only the top 1 c1 with maximum c3 (among all c2) i.e. sorting would be unnecessary since I'm only interested in the max.
library(plyr)
df <- data.frame(c1=c('a','a','a','b','b','c'),c2=c('x','y','y','x','y','x'),c3=c(1,2,3,4,5,6))
df
c1 c2 c3
1 a x 1
2 a y 2
3 a y 3
4 b x 4
5 b y 5
6 c x 6
sel <- plyr::ddply(df, c('c1','c2'), plyr::summarize,c3=sum(c3))
sel[with(sel, order(c1,-c3)),]
c1 c2 c3
2 a y 5 <<< this one highest c3 for (c1,c2) combination
1 a x 1
4 b y 5 <<< this one highest c3 for (c1,c2) combination
3 b x 4
5 c x 6 <<< this one highest c3 for (c1,c2) combination
I could do this in a loop but I'm wondering how it can be done in a vector fashion or using a high-level function.
Here's a base R approach:
df2 <- aggregate(c3~c1+c2, df, sum)
subset(df2[order(-df2$c3),], !duplicated(c1))
# c1 c2 c3
#3 c x 6
#4 a y 5
#5 b y 5
Another solution from dplyr.
library(dplyr)
df2 <- df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
filter(c3 == max(c3))
df2
# A tibble: 3 x 3
# Groups: c1 [3]
c1 c2 c3
<fctr> <fctr> <dbl>
1 a y 5
2 b y 5
3 c x 6
Here is another option with data.table
library(data.table)
setDT(df)[, .(c3 = sum(c3)) , .(c1, c2)][, .SD[which.max(c3)], .(c1)]
# c1 c2 c3
#1: a y 5
#2: b y 5
#3: c x 6
Using dplyr:
df %>%
group_by(c1, c2) %>%
summarise(c3 = sum(c3)) %>%
top_n(1, c3)
Or the last line can be slice(which.max(c3)), which will guarantee a single row.