I want to create a new variable, "F", by adding columns (B+C+D+E) if the column "A" is 1.
ID
A
B
C
D
E
001
1
1
2
NA
1
002
0
2
1
1
NA
df$F <- rowSums(df[df$A == '1', c(3:6)],na.rm=TRUE)
I get this error:
Error:
! Assigned data `rowSums(df[df$A == "1", c(3:6)], na.rm = TRUE)` must be compatible with existing data.
✖ Existing data has 12358 rows.
✖ Assigned data has 474 rows.
ℹ Only vectors of size 1 are recycled.
Backtrace:
1. base::`$<-`(`*tmp*`, F, value = `<dbl>`)
12. tibble (local) `<fn>`(`<vctrs___>`)
Error:
How can I fix this? Are there other ways to get my final outcome something looks like the one below?
ID
A
B
C
D
E
F
001
1
1
2
NA
1
4
002
0
2
1
1
NA
NA
Try this.
df$F <- ifelse(df$A == 1, rowSums(df[, c("B", "C", "D", "E")], na.rm=TRUE), NA)
df
# ID A B C D E F
# 1 1 1 1 2 NA 1 4
# 2 2 0 2 1 1 NA NA
We just need the logical to be on the lhs as well to keep the lengths same
df$F[df$A == '1'] <- rowSums(df[df$A == '1', c(3:6)],na.rm=TRUE)
-output
> df
ID A B C D E F
1 1 1 1 2 NA 1 4
2 2 0 2 1 1 NA NA
A tidyverse approach:
Libraries
library(dplyr)
Data
data <-
tibble::tribble(
~ID, ~A, ~B, ~C, ~D, ~E,
"001", 1L, 1L, 2L, NA, 1L,
"002", 0L, 2L, 1L, 1L, NA
)
Code
data %>%
rowwise() %>%
mutate(`F` = if_else(A == 1, sum(c_across(cols = B:E),na.rm = TRUE), NA_integer_) )
Output
# A tibble: 2 x 7
# Rowwise:
ID A B C D E F
<chr> <int> <int> <int> <int> <int> <int>
1 001 1 1 2 NA 1 4
2 002 0 2 1 1 NA NA
Related
R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2
I am trying to remove duplicates from a dataset (caused by merging). However, one row contains a value and one does not, in some cases both rows are NA. I want to keep the ones with data, and if there are on NAs, then it does not matter which I keep. How do I do that? I am stuck.
I unsuccessfully tried the solutions from here (also not usually working with data.table, so I dont understand whats what)
R data.table remove rows where one column is duplicated if another column is NA
Some minimum example data:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "D", "E", "G", "H", "J", "J"),
value = c(NA, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L))
ID value
A NA
A 1
B NA
B NA
C 1
D 1
E 1
G 1
H 1
J NA
J 1
and I want this:
ID value
A 1
B NA
C 1
D 1
E 1
G 1
H 1
J 1
One possibility using dplyr could be:
df %>%
group_by(ID) %>%
slice(which.max(!is.na(value)))
ID value
<chr> <int>
1 A 1
2 B NA
3 C 1
4 D 1
5 E 1
6 G 1
7 H 1
8 J 1
An alternative of #tmfmnk's answer with slice_max() in dplyr.
library(dplyr)
df %>%
group_by(ID) %>%
slice_max(!is.na(value), with_ties = F)
# # A tibble: 8 x 2
# # Groups: ID [8]
# ID value
# <chr> <int>
# 1 A 1
# 2 B NA
# 3 C 1
# 4 D 1
# 5 E 1
# 6 G 1
# 7 H 1
# 8 J 1
Here is a relatively simple data.table solution.
Grouping by ID if all the values are NA just take the first value, if not take all values that are not NA.
library(data.table)
setDT(df)
df[, if (all(is.na(value))) value[1] else value[!is.na(value)], by = ID]
My Data Frame looks something like the first three columns of this example:
id obs value newCol
a 1 uncool NA
a 2 cool 1
a 3 uncool NA
a 4 uncool NA
a 5 cool 2
a 6 uncool NA
a 7 cool 1
a 8 uncool NA
b 1 cool 0
What I need is a column (newCol above) that counts the number of "uncool"s between the observations with value "cool" or the first row of the group (grouped by id).
How do I do that (by using dplyr ideally)?
We can define groups by doing a cumsum starting from the bottom, then use ave to build a vector for each group :
transform(dat, newCol = ave(
value, id, rev(cumsum(rev(value=="cool"))),
FUN = function(x) ifelse(x=="cool", length(x)-1, NA)))
# id obs value newCol
# 1 a 1 uncool <NA>
# 2 a 2 cool 1
# 3 a 3 uncool <NA>
# 4 a 4 uncool <NA>
# 5 a 5 cool 2
# 6 a 6 uncool <NA>
# 7 a 7 cool 1
# 8 a 8 uncool <NA>
# 9 b 1 cool 0
With dplyr :
dat %>%
group_by(id,temp = rev(cumsum(rev(value=="cool")))) %>%
mutate(newCol = ifelse(value=="cool", n()-1, NA)) %>%
ungroup() %>%
select(-temp)
# # A tibble: 9 x 4
# id obs value newCol
# <chr> <int> <chr> <dbl>
# 1 a 1 uncool NA
# 2 a 2 cool 1
# 3 a 3 uncool NA
# 4 a 4 uncool NA
# 5 a 5 cool 2
# 6 a 6 uncool NA
# 7 a 7 cool 1
# 8 a 8 uncool NA
# 9 b 1 cool 0
Besides id you need another grouping variable, given by grp = cumsum(dat$value == "cool") - (dat$value == "cool") which is shown below.
Then you can use mutate where we assign sum(value == "uncool") to observations where value == "cool" and NA otherwise within each group.
library(dplyr)
dat %>%
group_by(id, grp = cumsum(dat$value == "cool") - (dat$value == "cool")) %>%
mutate(newCool = if_else(value == "cool", sum(value == "uncool"), NA_integer_))
# A tibble: 9 x 6
# Groups: id, grp [5]
id obs value newCol grp newCool
<chr> <int> <chr> <int> <int> <int>
1 a 1 uncool NA 0 NA
2 a 2 cool 1 0 1
3 a 3 uncool NA 1 NA
4 a 4 uncool NA 1 NA
5 a 5 cool 2 1 2
6 a 6 uncool NA 2 NA
7 a 7 cool 1 2 1
8 a 8 uncool NA 3 NA
9 b 1 cool 0 3 0
data
dat <- structure(list(id = c("a", "a", "a", "a", "a", "a", "a", "a",
"b"), obs = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L), value = c("uncool",
"cool", "uncool", "uncool", "cool", "uncool", "cool", "uncool",
"cool"), newCol = c(NA, 1L, NA, NA, 2L, NA, 1L, NA, 0L)), .Names = c("id",
"obs", "value", "newCol"), class = "data.frame", row.names = c(NA,
-9L))
We can create a helper function that will group value based on cool/uncool, and count the cools, i.e.
library(tidyverse)
f1 <- function(x) {
i1 <- which(x == 'cool')
v1 <- rep(seq_along(i1), c(i1[1], diff(i1)))
if (tail(x, 1) != 'cool') {
return(c(v1, tail(v1, 1) + 1))
} else {
return(v1)
}
}
df %>%
group_by(id) %>%
mutate(new_grp = f1(value)) %>%
group_by(id, new_grp) %>%
mutate(new = length(value[value != 'cool']),
new = replace(new, value != 'cool', NA)) %>%
ungroup() %>%
select(-new_grp)
which gives,
# A tibble: 9 x 5
id obs value newCol new
<fct> <int> <fct> <int> <int>
1 a 1 uncool NA NA
2 a 2 cool 1 1
3 a 3 uncool NA NA
4 a 4 uncool NA NA
5 a 5 cool 2 2
6 a 6 uncool NA NA
7 a 7 cool 1 1
8 a 8 uncool NA NA
9 b 1 cool 0 0
Writing simple function to solve your problem:
# Your data
data <- data.frame(id = c("a", "a", "a", "a", "a", "a" ,"a" ,"a", "b"),
obs = c(1,2,3,4,5,6,7,8,1),
value = c("uncool", "cool", "uncool", "uncool", "cool", "uncool" ,"cool" ,"uncool", "cool"),
stringsAsFactors = FALSE)
# Function for solving problem
cool_counter <- function(vector) {
uncool <- FALSE
count <- 0
results <- list()
for(i in 1:length(vector)) {
if(i == 1) {
uncool <- vector[i] == "uncool"
results[[i]] <- NA
if(uncool) {
count <- 1
}
}
if(i > 1) {
uncool <- vector[i] == "uncool"
if(uncool) {
count <- count + 1
results[[i]] <- NA
}
if(!uncool) {
results[[i]] <- count
count <- 0
}
}
}
return(unlist(results))
}
This gives:
# Running function
library(dplyr)
data <- data %>%
group_by(id) %>%
mutate(newCol = cool_counter(value))
# Results
data
id obs value newCol
<chr> <dbl> <chr> <dbl>
1 a 1 uncool NA
2 a 2 cool 1
3 a 3 uncool NA
4 a 4 uncool NA
5 a 5 cool 2
6 a 6 uncool NA
7 a 7 cool 1
8 a 8 uncool NA
9 b 1 cool NA
currently i am trying to count frequency of set of sequence of data frame.
A B
1 a
1 b
1 c
2 a
2 b
2 c
i have this data frame and i would like to count frequency of "B" of another data frame looking like this
C D
1 a
1 a
1 b
1 b
2 b
2 c
2 c
As you can see the number of rows is different so datatable(counts) does not work. i would like to it to look like this after frequency count is done
a b freq
1 a 2
1 b 2
1 c 0
2 a 0
2 b 1
2 c 2
As you can see it makes counts of all the frequency even the 0 as the on some groups there is no data on it.
thanks for anyone that helps!
By using merge and aggregate
df2$freq = 1
df = merge(df1,aggregate(freq~.,df2,length),by.x = c('A','B'),by.y = c('C','D'),all.x = T)
df[is.na(df)] = 0
df
A B freq
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
More Info
aggregate(freq~.,df2,length)
C D freq
1 1 a 2
2 1 b 2
3 2 b 1
4 2 c 2
Data Input
df1
A B
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
df2
C D
1 1 a
2 1 a
3 1 b
4 1 b
5 2 b
6 2 c
7 2 c
This looks to be a question of how to tabulate frequencies across two factors without dropping missing levels.
Here's the dplyr solution. This assumes that dfAB, as in your example data, contains no duplicates (dfAB is interchangeable with the output of expand.grid if you don't already have the level combinations in a data frame)
library(dplyr)
dfAB %>%
# need at least one non-joining variable to tell matches from non-matches
left_join(mutate(dfCD, dummy = 1), by = c("A" = "C", "B" = "D")) %>%
group_by(A, B) %>%
summarize(freq = sum(dummy, na.rm = TRUE))
Output:
# A tibble: 6 x 3
# Groups: A [?]
A B freq
<dbl> <chr> <dbl>
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
(if there are duplicates in dfAB, add a distinct call to the chain before the join)
df1_rows = Reduce(paste, df1)
df2_rows = Reduce(paste, df2)
data.frame(df1, freq = sapply(df1_rows, function(x) sum(df2_rows %in% x)),
row.names = NULL)
# A B freq
#1 1 a 2
#2 1 b 2
#3 1 c 0
#4 2 a 0
#5 2 b 1
#6 2 c 2
DATA
df1 = data.frame(A = c(1L, 1L, 1L, 2L, 2L, 2L),
B = c("a", "b", "c", "a", "b", "c"))
df2 = data.frame(C = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
D = c("a", "a", "b", "b", "b", "c", "c"))
I have a dataframe with a lot of variables seen in multiple conditions. I'd like to merge each variable by condition.
The example data frame is a simplified version of what I have (3 variables over 2 conditions).
VAR.B_1 <- c(1, 2, 3, 4, 5, 'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_2 <- c(2, 2, 3, 4, 5,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_3 <- c(1, 1, 1, 1, 1,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.E_1 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
VAR.E_2 <- c(NA, NA, NA, NA, NA, 1, 2, 3, 4, 5)
VAR.E_3 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
Condition <- c("B", "B","B","B","B","E","E","E","E","E")
#Example dataset
data<-as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3, Condition))
I want to end up with this, appended to the original data frame:
VAR_1 VAR_2 VAR_3
1 2 1
2 2 1
3 3 1
4 4 1
5 5 1
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
I understand that R won't work with i inside the variable name, but I have an example of the kind of for loop I was trying to do. I would rather not call variables by column location, since there will be a lot of variables.
##Example of how I want to merge - this code does not work
for(i in 1:3) {
data$VAR_[,i] <-ifelse(data$Condition == "B", VAR.B_[,i],
ifelse(data$Condition == "E", VAR.E_[,i], NA))
}
This might work for your situation:
library(tidyverse)
library(stringr)
data %>%
mutate_all(as.character) %>%
gather(key, value, -Condition) %>%
filter(!is.na(value), value != "NA") %>%
mutate(key = str_replace(key, paste0("\\.", Condition), "")) %>%
group_by(Condition, key) %>%
mutate(rowid = 1:n()) %>%
spread(key, value) %>%
bind_cols(data)
#> # A tibble: 10 x 12
#> # Groups: Condition [2]
#> Condition rowid VAR_1 VAR_2 VAR_3 VAR.B_1 VAR.B_2 VAR.B_3 VAR.E_1
#> <chr> <int> <chr> <chr> <chr> <fctr> <fctr> <fctr> <fctr>
#> 1 B 1 1 2 1 1 2 1 NA
#> 2 B 2 2 2 1 2 2 1 NA
#> 3 B 3 3 3 1 3 3 1 NA
#> 4 B 4 4 4 1 4 4 1 NA
#> 5 B 5 5 5 1 5 5 1 NA
#> 6 E 1 1 1 1 NA NA NA 1
#> 7 E 2 1 2 1 NA NA NA 1
#> 8 E 3 1 3 1 NA NA NA 1
#> 9 E 4 1 4 1 NA NA NA 1
#> 10 E 5 1 5 1 NA NA NA 1
#> # ... with 3 more variables: VAR.E_2 <fctr>, VAR.E_3 <fctr>,
#> # Condition1 <fctr>
data.frame(lapply(split.default(data[-NCOL(data)], gsub("\\D+", "", head(names(data), -1))),
function(a){
a = sapply(a, function(x) as.numeric(as.character(x)))
rowSums(a, na.rm = TRUE)
}))
# X1 X2 X3
#1 1 2 1
#2 2 2 1
#3 3 3 1
#4 4 4 1
#5 5 5 1
#6 1 1 1
#7 1 2 1
#8 1 3 1
#9 1 4 1
#10 1 5 1
#Warning messages:
#1: In FUN(X[[i]], ...) : NAs introduced by coercion
#2: In FUN(X[[i]], ...) : NAs introduced by coercion
#3: In FUN(X[[i]], ...) : NAs introduced by coercion
Your data appears to have two kinds of NA values in it. It has NA, or R's NA value, and it also has the string 'NA'. In my solution below, I replace both with zero, cast each column in the data frame to numeric, and then just sum together like-numbered VAR columns. Then, drop the original columns which you don't want anymore.
data <- as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3),
stringsAsFactors=FALSE)
data[is.na(data)] <- 0
data[data == 'NA'] <- 0
data <- as.data.frame(lapply(data, as.numeric))
data$VAR_1 <- data$VAR.B_1 + data$VAR.E_1
data$VAR_2 <- data$VAR.B_2 + data$VAR.E_2
data$VAR_3 <- data$VAR.B_3 + data$VAR.E_3
data <- data[c("VAR_1", "VAR_2", "VAR_3")]
Demo