Right now I have a dataset that roughly looks like this:
Id Eng_ver_1 Eng_ver_2 Bio_ver_1 Bio_ver_2 Subject Version
1 NA 1 NA NA Eng 2
2 NA NA NA 1 Bio 2
3 NA NA 1 NA Bio 1
4 1 NA NA NA Eng 1
The columns represent conditions that participants go through. Because each person only goes through one condition, it is guaranteed that in every row only 1 of the 4 columns has a value. Instead of looking like this, it is easier to do analysis in my case if the data were to look like this:
Id Subject Version Score
1 English 2 1
2 Biology 2 1
3 Biology 1 1
4 English 1 1
Is there any quick way of doing this transformation? In other words, how do I get rid of all the NAs and shrink the 4 columns into 1.
Additionally, What if instead of 4 columns, I have 40 columns, with each Id only having data in 10 out of those 40 columns?
Since you'll have data in only one column in each row I think using rowSums as suggested by #alistaire would be easy and quick solution.
You can also get data in long format with pivot_longer in tidyr :
library(dplyr)
df %>%
tidyr::pivot_longer(cols = matches('.*_ver_\\d+'),
values_drop_na = TRUE, values_to = 'score') %>%
select(-name)
# A tibble: 4 x 4
# Id Subject Version score
# <int> <chr> <int> <int>
#1 1 Eng 2 1
#2 2 Bio 2 1
#3 3 Bio 1 1
#4 4 Eng 1 1
data
df <- structure(list(Id = 1:4, Eng_ver_1 = c(NA, NA, NA, 1L), Eng_ver_2 = c(1L,
NA, NA, NA), Bio_ver_1 = c(NA, NA, 1L, NA), Bio_ver_2 = c(NA,
1L, NA, NA), Subject = c("Eng", "Bio", "Bio", "Eng"), Version = c(2L,
2L, 1L, 1L)), class = "data.frame", row.names = c(NA, -4L))
We can use coalesce from dplyr
library(dplyr)
df %>%
transmute(Id, Subject, Version,
score = coalesce(!!! select(., contains('ver'))))
# Id Subject Version score
#1 1 Eng 2 1
#2 2 Bio 2 1
#3 3 Bio 1 1
#4 4 Eng 1 1
data
df <- structure(list(Id = 1:4, Eng_ver_1 = c(NA, NA, NA, 1L), Eng_ver_2 = c(1L,
NA, NA, NA), Bio_ver_1 = c(NA, NA, 1L, NA), Bio_ver_2 = c(NA,
1L, NA, NA), Subject = c("Eng", "Bio", "Bio", "Eng"), Version = c(2L,
2L, 1L, 1L)), class = "data.frame", row.names = c(NA, -4L))
Related
Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())
X Y
1 1 2
2 2 4
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 1 4
8 2 6
9 1 8
10 1 10
It should be so: In the first case the average of the values 2 and 4 is 3 In the second case, the average of the values 4,6,8,10 is 7 and so on...
Your data:
df = data.frame(X=c(1,2,NA,NA,NA,NA,1,2,1,1),Y=c(2,4,NA,NA,NA,NA,4,6,8,10))
You can define rows with consecutive rows with no NAs using diff(complete.cases(..)) :
blocks = cumsum(c(0,diff(complete.cases(df)) != 0 ))
block_means = tapply(df$Y,blocks,mean)
0 1 2
3 NA 7
block_means[!is.na(block_means)]
0 2
3 7
Or if you don't need to know the order:
na.omit(as.numeric(tapply(df$Y,blocks,mean)))
[1] 3 7
We can create groups of continuous values using rleid from data.table , within each group calculate the mean of Y values/
library(dplyr)
df %>%
group_by(gr = data.table::rleid(is.na(Y))) %>%
summarise(Y = mean(Y, na.rm = TRUE)) %>%
filter(!is.na(Y)) -> df1
df1
# gr Y
# <int> <dbl>
#1 1 3
#2 3 7
data.table way of doing this would be :
library(data.table)
df1 <- setDT(df)[, .(Y = mean(Y, na.rm = TRUE)), rleid(is.na(Y))][!is.na(Y)]
data
df <- structure(list(X = c(1L, 2L, NA, NA, NA, NA, 1L, 2L, 1L, 1L),
Y = c(2L, 4L, NA, NA, NA, NA, 4L, 6L, 8L, 10L)),
class = "data.frame", row.names = c(NA, -10L))
I have a data frame that looks like this
column1
1
1
2
3
3
and I would like to give a unique ID to each element. My problem is that I can not
find a way the unique IDs to start from zero and be like this
column1 column2
1 0
1 0
2 1
3 2
3 2
Any help is appreciated
Try this, cur_group_id from dplyr will create the id from 1 but you can easily make it to start from zero:
library(dplyr)
#Data
df <- structure(list(column1 = c(0L, 1L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,-5L))
#Mutate
df %>% group_by(column1) %>% mutate(id=cur_group_id()-1)
# A tibble: 5 x 2
# Groups: column1 [4]
column1 id
<int> <dbl>
1 0 0
2 1 1
3 2 2
4 3 3
5 3 3
We could use match
library(dplyr)
df1 %>%
mutate(column2 = match(column1, unique(column1)) - 1)
data
df1 <- structure(list(column1 = c(1L, 1L, 2L, 3L, 3L)), class = "data.frame",
row.names = c(NA,
-5L))
I have the following dataframe in R:
Date A B C
1 2015-01-17 1 NA 1
2 2015-01-18 NA NA NA
3 2015-01-19 1 2 3
4 2015-01-19 1 NA 1
...
The goal is that different rows having the same date add their values in columns A,B,C:
Date A B C
1 2015-01-17 1 NA 1
2 2015-01-18 NA NA NA
3 2015-01-19 2 2 4
...
Thank you for your help.
library(dplyr)
df %>%
group_by(Date)%>%
summarise_at(.,c("A","B","C"),function(x) if(any(!is.na(x)))sum(x,na.rm = T) else NA)
# A tibble: 3 x 4
Date A B C
<fct> <int> <int> <int>
1 2015-01-17 1 NA 1
2 2015-01-18 NA NA NA
3 2015-01-19 2 2 4
data:
df <- structure(list(Date = structure(c(1L, 2L, 3L, 3L), .Label = c("2015-01-17",
"2015-01-18", "2015-01-19"), class = "factor"), A = c(1L, NA,
1L, 1L), B = c(NA, NA, 2L, NA), C = c(1L, NA, 3L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Another option is sum_ from hablar
library(hablar)
library(dplyr)
df %>%
group_by(Date) %>%
summarise_if(is.numeric, sum_)
# A tibble: 3 x 4
# Date A B C
# <fct> <int> <int> <int>
#1 2015-01-17 1 NA 1
#2 2015-01-18 NA NA NA
#3 2015-01-19 2 2 4
data
df <- structure(list(Date = structure(c(1L, 2L, 3L, 3L), .Label = c("2015-01-17",
"2015-01-18", "2015-01-19"), class = "factor"), A = c(1L, NA,
1L, 1L), B = c(NA, NA, 2L, NA), C = c(1L, NA, 3L, 1L)),
class = "data.frame", row.names = c("1",
"2", "3", "4"))
I am relatively new to R but slowly finding my way. I encountered a problem, however, and hope someone can help me.
Let's say I two dataframes (lets call them A and B), both containing survey responses. A contains all responses from the first set of people. B contains the responses of the second set of people, plus the people of the first set but with their responses set to NA. An example:
Dataframe A:
Household Individual Answer_A Answer_b
1 2 5 6
1 3 6 6
2 1 2 3
Dataframe B:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 NA NA
1 3 NA NA
2 1 NA NA
2 2 4 7
I want to get one dataframe with all individuals and their responses:
Dataframe C:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 5 6
1 3 6 6
2 1 2 3
2 2 4 7
If I only have two datasets I can use rbind.fill, with rbind.fill(B, A) to get dataframe C, as then the NAs in B are overwritten with answers in A.
But... if I would have to add a third dataset, D, that would consist of NAs for people in A and B, I would not be able to use this solution. What would I be able to do then? I've looked at intersect, outersect, different forms of join, but can't seem to think of a good solution.
Any thoughts?
Maybe you can left_join and then use coalesce
library(dplyr)
left_join(B, A, by = c("Household", "Individual")) %>%
mutate(Answer_A = coalesce(Answer_A.x, Answer_A.y),
Answer_B = coalesce(Answer_b.x, Answer_b.y)) %>%
select(-matches("\\.x|\\.y"))
# Household Individual Answer_A Answer_B
#1 1 1 3 6
#2 1 2 5 6
#3 1 3 6 6
#4 2 1 2 3
#5 2 2 4 7
data
A <- structure(list(Household = c(1L, 1L, 2L), Individual = c(2L,
3L, 1L), Answer_A = c(5L, 6L, 2L), Answer_b = c(6L, 6L, 3L)), class = "data.frame",
row.names = c(NA, -3L))
B <- structure(list(Household = c(1L, 1L, 1L, 2L, 2L), Individual = c(1L,
2L, 3L, 1L, 2L), Answer_A = c(3L, NA, NA, NA, 4L), Answer_b = c(6L,
NA, NA, NA, 7L)), class = "data.frame", row.names = c(NA, -5L))