I know that this is not the big trouble but I'm juts new to this. I have this output obtained from merging two dataframes. Each one has a column that corresponds to the sex for each participant of an event.
Sex.x
Sex.y
M
M
F
F
F
M
M
M
F
F
M
M
NA
M
F
F
Desired output: the two columns mixed in one that has "?" when their two values doesn't match and that conserves the only value if there is a NA in the adjacent cell.
F_Sex
M
F
?
M
F
M
M
F
I was trying to do it with dplyr package but I just get to this code. I know I need to use if_else but after many tries, I have nothing.
all_data1 <- all_data %>% unite(F_sexo, c(sexo.x, sexo.y), sep = "-", remove = TRUE)
Thanks a lot in advance.
Here is one idea. Use coalesce first to get the rows with only one NA to have the correct sex. And then use an ifelse to change those rows with different sexes to ?.
Notice that if you have a row with both columns are NA, this solution will return NA. Please make sure this is the behavior you want.
library(dplyr)
dat2 <- dat %>%
mutate(Sex = coalesce(.$Sex.x, .$Sex.y)) %>%
mutate(Sex = ifelse(Sex.x != Sex.y & !is.na(Sex.x) & !is.na(Sex.y), "?", Sex))
dat2
# Sex.x Sex.y Sex
# 1 M M M
# 2 F F F
# 3 F M ?
# 4 M M M
# 5 F F F
# 6 M M M
# 7 <NA> M M
# 8 F F F
DATA
dat <- read.table(text = "Sex.x Sex.y
M M
F F
F M
M M
F F
M M
NA M
F F", header = TRUE)
Check this solution. The data is assigned as df.
df %>% mutate(F_sex = case_when(Sex.x == Sex.y ~ Sex.x,
TRUE ~"?"))
or
df %>% mutate(F_sex = case_when(is.na(Sex.x) ~ Sex.y,
is.na(Sex.y) ~ Sex.x,
Sex.x == Sex.y ~ Sex.x,
TRUE ~"?"))
Related
Here is my simplified df:
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- as.data.frame(cbind(GP_A,GP_B,GENDER,LOC,SCORE))
> df
GP_A GP_B GENDER LOC SCORE
1 a d M HK 50
2 a d M HK 70
3 a e M UK 80
4 b e M UK 20
5 b e F UK 30
6 c e F JP 80
7 c f F JP 90
I want to summarize the score by GP_A, GP_B, or other grouping columns which are not showing in this example. As the count of grouping columns might up to 50, I decided to use for-loop to summarize the score.
The original method is summarizing the score with 1 group one by one:
GP_A_SCORE <- df %>% group_by(GP_A,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
GP_B_SCORE <- df %>% group_by(GP_B,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
...
What I want is using the for-loop like this (cannot run):
GP_list <- c("GP_A","GP_B",...)
LOC_list <- c("HK","UK","JP",...)
SCORE <- list()
for (i in GP_list){
for (j in LOC_list){
SCORE[[paste0(i,j)]] <- df %>% group_by(i,j,GENDER) %>% summarize(SCORE=mean(SCORE))
}}
As in "group_by()", the variables are classified as character and here is the error shown:
Error: Column I, J is unknown
Is there any method to force R to recognize the variable?
I am facing the same problem on the left_join of dplyr.
Error is shown when I was doing something like: left_join(x,y,by=c(i=i)) inside a loop.
You could get the data in long format and then calculate the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('GP')) %>%
group_by(GENDER ,LOC, name, value) %>%
summarise(SCORE = mean(SCORE))
# GENDER LOC name value SCORE
# <fct> <fct> <chr> <fct> <dbl>
# 1 F JP GP_A c 85
# 2 F JP GP_B e 80
# 3 F JP GP_B f 90
# 4 F UK GP_A b 30
# 5 F UK GP_B e 30
# 6 M HK GP_A a 60
# 7 M HK GP_B d 60
# 8 M UK GP_A a 80
# 9 M UK GP_A b 20
#10 M UK GP_B e 50
We can use melt from data.table
library(data.table)
melt(setDT(df), measure = patterns("^GP"))[, .(SCORE = mean(SCORE)),
.(GENDER, LOC, variable, value)]
data
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)
I want to add just some specific values from column z in dataframe df2 into dataframe df1, but just for the id = 1 and id = 3.
I have already tried solutions with ifelse, but for the missing values that kind of solutions work for the first value, until find the first missing gap.
df1$z <- ifelse((df1$id == df2$id), df2$z, 0)
Examples of the data:
df1 <- read.table(text = "
id v w
1 20 B
3 30 T
", h = T)
df2 <- read.table(text = "
id z b c d e f g h i j
1 100 z w e r y w u y q
2 800 t q j n m q i x z
3 700 f e q b a i e p w
4 300 a b c d a g s y q"
, h = T)
Expected result:
df1_add <- read.table(text = "
id v w z
1 20 B 100
3 30 T 700
", h = T)
Let's use left_join() and select() from the dplyr package:
library(dplyr)
df1_add <- df1 %>%
left_join(df2 %>% select(id, z))
df1_add
id v w z
1 1 20 B 100
2 3 30 T 700
you can try this
df_add <- df1
df_add$z = df2[df2$id %in% c(1, 3), ]$z
We can use merge from base R
merge(df1, df2[c("id", "z")])
I have a dataset with answers from a survey of 17 questions (10 questions are 5 or 7 questions are 7 point scale), and now the data format gives me 5 or 7 columns for each question answer (True or False), which is like a one-hot encoding style. And I want to convert these columns back to 15 single column.
To be more specific, the data I have looks like the following
Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 .... Q17.1 Q17.2 ... Q17.5
row1 T F F F F F F F T F
... ...
row2000 F T F F F F F T F F
the desired format I want to have is
Q1 Q2 .... Q17
row1 1 4 2 # with number indicating the value that the column is True
....
row2000 2 3 1 #(e.g., if Q2.4 is T, then for Q2, it is 4).
Base R approach using split.default and max.col. Using split.default we can split the columns based on the pattern in their name, so that every question is divided into a list. Assuming every question would have only one TRUE value we can use max.col to find the TRUE index.
sapply(split.default(df, sub("\\..*", "", names(df))), max.col)
# Q1 Q2
#[1,] 1 2
#[2,] 6 5
data
df <-read.table(text = "Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q2.1 Q2.2 Q2.3 Q2.4 Q2.5
T F F F F F F F T F F F
F F F F F T F F F F F T", header = T)
This is assuming class of your data is "logical". If "T"/"F" is stored in character format (like in #Maurits answer) we need to convert them to logical first.
Using data from #Maurits Evers
df[] <- lapply(df, as.logical)
sapply(split.default(df, sub("\\..*", "", names(df))), max.col)
# Q1 Q17
#[1,] 1 2
#[2,] 2 1
Here is a tidyverse option:
library(tidyverse)
df %>%
rownames_to_column("row") %>%
gather(k, v, -row) %>%
separate(k, c("question", "part"), sep = "\\.") %>%
filter(v == "T") %>%
group_by(row) %>%
select(-v) %>%
spread(question, part)
## A tibble: 2 x 3
## Groups: row [2]
# row Q1 Q17
# <chr> <chr> <chr>
#1 row1 1 2
#2 row2000 2 1
I assume that your original data contains "T"/"F" as character entries. If they are in fact TRUE/FALSE, you should change filter(v == "T") to filter(v == TRUE).
Sample data
df <- read.table(text =
"Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q17.1 Q17.2 Q17.5
row1 T F F F F F F F T F
row2000 F T F F F F F T F F", colClasses = "character")
I'm working with a large dataset and doing some calculation with the aggregate() function.
This time I need to group by two different columns and for my calculation I need a user defined function that also uses two columns of the data.frame. That's where I'm stuck.
Here's an example data set:
dat <- data.frame(Kat = c("a","b","c","a","c","b","a","c"),
Sex = c("M","F","F","F","M","M","F","M"),
Val1 = c(1,2,3,4,5,6,7,8)*10,
Val2 = c(2,6,3,3,1,4,7,4))
> dat
Kat Sex Val1 Val2
a M 10 2
b F 20 6
c F 30 3
a F 40 3
c M 50 1
b M 60 4
a F 70 7
c M 80 4
Example of user defined function:
sum(Val1 * Val2) # but grouped by Kat and Sex
I tried this:
aggregate((dat$Val1),
by = list(dat$Kat, dat$Sex),
function(x, y = dat$Val2){sum(x*y)})
Output:
Group.1 Group.2 x
a F 1710
b F 600
c F 900
a M 300
b M 1800
c M 2010
But my expected output would be:
Group.1 Group.2 x
a F 610
b F 120
c F 90
a M 20
b M 240
c M 370
Is there any way to do this with aggregate()?
As #jogo suggested :
aggregate(Val1 * Val2 ~ Kat + Sex, FUN = sum, data = dat)
Or in a tidyverse style
library(dplyr)
dat %>%
group_by(Kat, Sex) %>%
summarize(sum(Val1 * Val2))
Or with data.table
library(data.table)
setDT(dat)
dat[ , sum(Val1 * Val2), by = list(Kat, Sex)]
I want to combine different column value rows into a new column row.
Example df like this:
df <- data.frame(area = c("a","b","c","a"),
d = c(1,3,6,3),
f = c(3,2,8,2),
e = c(4,7,1,8),
g = c(6,9,2,9))
Where a,b,c are area column value, I want to combine/sum two rows (a,c) into one to get:
area d f e g
a+c+a 10 13 13 17
b 3 2 7 9
AND I have tried like this:
df <- aggregate(df, list(area=replace(area == c("a","c"), "a+c+a")), sum)
But it won't work.
Thank you.
Another solution using dplyr
library(dplyr)
aggr <- df[df$area %in% c("a", "c"),-1] %>%
summarize_all(sum)
rbind(df[!(df$area %in% c("a", "c")),],
bind_cols(area = "a+c+a", aggr))
# area d f e g
# 2 b 3 2 7 9
# 1 a+c+a 10 13 13 17