I have a dataframe in R that looks like the following:
a b c condition
1 4 2 acap
2 3 1 acap
2 4 3 acap
5 6 8 ncap
5 7 6 ncap
8 7 6 ncap
I am trying to recode the values in columns a, b, and c for condition ncap (and also 2 other conditions not pictured here) while leaving the values for acap alone.
The following code works when applied to the first 3 columns. I am trying to figure out how I can apply this only to rows that I specify by condition while keeping everything in the same dataframe.
df = df %>%
mutate_at(vars(a:c), function(x)
case_when x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4)
This is the expected output.
a b c condition
1 4 2 acap
2 3 1 acap
2 4 3 acap
1 2 4 ncap
1 3 2 ncap
4 3 2 ncap
I've looked around for an answer to this question and am unable to find it. If someone knows of an answer that already exists, I would appreciate being directed to it.
We can use the case_when on a condition created with row_number i.e. if the row number is 4 to 6, subtract 4 from the value or else return the value
df %>%
mutate_at(vars(a:c), funs(case_when(row_number() %in% 4:6 ~ . - 4L,
TRUE ~ .)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
If this is based on the value instead of the rows, create the condition on the value
df %>%
mutate_at(vars(a:c), funs(case_when(. %in% 5:8 ~ . - 4L,
TRUE ~ .)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
Or if it is based on the value in the 'condition'
df %>%
mutate_at(vars(a:c), funs(case_when(condition == 'ncap' ~ . - 4L,
TRUE ~ .)))
Or without using any case_when
df %>%
mutate_at(vars(a:c), funs( . - c(0, 4)[(condition == 'ncap')+1]))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
In base R, we can do this by creating the index
i1 <- df$condition =='ncap'
df[i1, 1:3] <- df[i1, 1:3] - 4
data
df <- structure(list(a = c(1L, 2L, 2L, 5L, 5L, 8L), b = c(4L, 3L, 4L,
6L, 7L, 7L), c = c(2L, 1L, 3L, 8L, 6L, 6L), condition = c("acap",
"acap", "acap", "ncap", "ncap", "ncap")), class = "data.frame",
row.names = c(NA, -6L))
You can use filter to apply recoding values to only specific rows (not equal to "acap" here)
library(dplyr)
df %>%
filter(condition != "acap") %>%
mutate_at(vars(a:c), function(x)
case_when(x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4))
# a b c condition
#1 1 2 4 ncap
#2 1 3 2 ncap
#3 4 3 2 ncap
If you need the entire dataframe back again we can do
df %>%
filter(condition == "acap") %>%
bind_rows(df %>%
filter(condition != "acap") %>%
mutate_at(vars(a:c), function(x)
case_when(x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
Related
I have a table in R like this:
x Y
1 2 1
2 1 1
3 NA 1
4 2 NA
5 1 2
6 2 2
7 1 1
and what I'm hoping to do is make a new column called xy which bases on if there is a 1 exist in either x or y.
For example, if x is 1 and y is 2 then the xy should be 1 ; if x is NAand y is 1 then the xyshould be 1. If both x and y is 2 then xyshould be 2.
The priority of the categorical variables 1, 2 and NA is 1>2>NA.
In short what my desired output looks like this:
x Y XY
1 2 1 1
2 1 1 1
3 NA 1 1
4 2 NA 2
5 NA NA NA
6 2 2 2
7 1 1 1
I'm new to R and trying to trim my data. Thank you for your help! I'm really appreciated:)
Try this
library(dplyr)
df |> rowwise() |>
mutate(z1 = coalesce(c_across(x) , 0) , z2 = coalesce(c_across(Y) , 0)) |>
mutate(XY = case_when(any(c_across(z1:z2) == 1) ~ 1 , any(c_across(z1:z2) == 2) ~ 2)) |>
select(-z1 , -z2) |> ungroup() -> ans
output
# A tibble: 7 × 3
x Y XY
<int> <int> <dbl>
1 2 1 1
2 1 1 1
3 NA 1 1
4 2 NA 2
5 NA NA NA
6 2 2 2
7 1 1 1
data
df <- structure(list(x = c(2L, 1L, NA, 2L, NA, 2L, 1L), Y = c(1L, 1L,
1L, NA, NA, 2L, 1L)), row.names = c("1", "2", "3", "4", "5",
"6", "7"), class = "data.frame")
You could do it with a case_when (remembering that it evaluates from the bottom and up):
library(dplyr)
df <-
df |>
mutate(XY = case_when(x == 1 | Y == 1 ~ 1,
x == 2 | Y == 2 ~ 2,
TRUE ~ NA_real_))
Or apply the same logic using base functionalities:
df$XY <- NA
df$XY[df$x == 2 | df$Y == 2] <- 2
df$XY[df$x == 1 | df$Y == 1] <- 1
Output:
x Y XY
<dbl> <dbl> <dbl>
1 2 1 1
2 1 1 1
3 NA 1 1
4 2 NA 2
5 NA NA NA
6 2 2 2
7 1 1 1
Data:
library(readr)
df <- read_table("
x Y
2 1
1 1
NA 1
2 NA
NA NA
2 2
1 1")
Here is a base R approach. For each row, check if any value is 1 (removing NA) and if so, set value of XY of 1. Then, check for any value of 2 in a similar fashion. If neither of those found, then set as NA. If you have more columns, you can subset in the function call for those specific columns to be evaluated (in this case, x and Y).
df$XY <- apply(df,
1,
function(x) {
if (any(x == 1, na.rm = T)) return(1)
if (any(x == 2, na.rm = T)) return(2)
return(NA)
})
Output
x Y XY
1 2 1 1
2 1 1 1
3 NA 1 1
4 2 NA 2
5 NA NA NA
6 2 2 2
7 1 1 1
I have a data set (n=500) in R that looks like this
ID A C S
1 4 4 4
2 3 2 3
3 5 4 2
Id like to create a new variable(I am calling this variable "same") that tells me whether any of my columns have the same value (excluding my ID column). So,
ID A C S Same
1 4 4 4 all
2 3 2 3 as
3 5 4 2 none
4 7 7 2 ac
Any help would be much appreciated! I am pretty lost! Thank you!
We may loop over the rows with apply (MARGIN = 1) with selected columns ([-1] without the 'ID' column), then check the length of unique elements, if it is 1, return 'all' or else paste the names of the duplicated elements. If there are no duplicates, then it returns blank "", change the blank to 'none'
df1$Same <- apply(df1[-1], 1, \(x) {
x1 <- if(length(unique(x)) == 1) 'all' else
paste(tolower(names(x))[duplicated(x)|duplicated(x,
fromLast = TRUE)], collapse = "")
x1[x1 == ""] <- "none"
x1})
-output
> df1
ID A C S Same
1 1 4 4 4 all
2 2 3 2 3 as
3 3 5 4 2 none
4 4 7 7 2 ac
data
df1 <- structure(list(ID = 1:4, A = c(4L, 3L, 5L, 7L), C = c(4L, 2L,
4L, 7L), S = c(4L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
Try this using dplyr rowwise with rle
df |> rowwise() |> mutate(Same = case_when(length(rle(sort(c_across(A:S)))$values) == 1 ~ "all" ,
length(rle(sort(c_across(A:S)))$values) == 3 ~ "none" ,
c_across(A) == c_across(C) ~ "ac" ,
c_across(C) == c_across(S) ~ "cs" , TRUE ~ "as"))
output
# A tibble: 4 × 5
# Rowwise:
ID A C S Same
<int> <int> <int> <int> <chr>
1 1 4 4 4 all
2 2 3 2 3 as
3 3 5 4 2 none
4 4 7 7 2 ac
R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2
I have an issue in R I cannot fix, so I'm asking for help here. I want to merge three columns into one, but haven't found a way to do so. Let's say it looks like this table:
Time H C W K
0 1 2 0 5
1 5 2 1 1
2 0 1 2 2
How do I turn it into this table:
Time G K
0 3 5
1 8 1
2 3 2
Maybe you can try the code below
subset(within(df, G <- rowSums(cbind(H, C, W))), select = -c(H, C, W))
giving
Time K G
1 0 5 3
2 1 1 8
3 2 2 3
or a data.table option
> setDT(df)[, .(Time, G = rowSums(cbind(H, C, W)), K)][]
Time G K
1: 0 3 5
2: 1 8 1
3: 2 3 2
We can use transmute
library(dplyr)
df %>%
transmute(Time, G = rowSums(select(., H:W)), K)
# Time G K
#1 0 3 5
#2 1 8 1
#3 2 3 2
Maybe try this:
#Code
newdf <- data.frame(df[,1,drop=F],G=rowSums(df[,-c(1,5)]),df[,5,drop=F])
Output:
Time G K
1 0 3 5
2 1 8 1
3 2 3 2
Some data used:
#Data
df <- structure(list(Time = 0:2, H = c(1L, 5L, 0L), C = c(2L, 2L, 1L
), W = 0:2, K = c(5L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
Also a shortcut instead of placing each variable and improving the answer of #KarthikS can be using c_across():
library(dplyr)
#Code2
newdf <- df %>% rowwise() %>% mutate(G = sum(c_across(H:W))) %>% select(Time, G, K)
Output:
# A tibble: 3 x 3
# Rowwise:
Time G K
<int> <int> <int>
1 0 3 5
2 1 8 1
3 2 3 2
I have a strange dataset format where a simple reshape function won't work. Assume I have three time periods (1-3); 2 id Names (A-B); and three variables (X,Y and Z) in the following format. Where the id names and variables name are seperated by -:
Time A-X A-Y A-Z B-X B-Y B-Z
1 2 4 5 6 1 2
2 2 3 2 3 2 3
3 4 4 4 4 4 4
Ideally, I would like to produce the dataset in the following format:
ID Time X Y Z
A 1 2 4 5
A 2 2 3 2
A 3 4 4 4
B 1 6 1 2
B 2 3 2 3
B 3 4 4 4
Which functions to use?
library(dplyr)
library(tidyr)
library(splitstackshape)
df %>%
gather(key, value, -Time) %>%
cSplit("key", sep="_") %>%
spread(key_2, value) %>%
rename(ID = key_1) %>%
arrange(ID, Time)
Output is:
Time ID X Y Z
1 1 A 2 4 5
2 2 A 2 3 2
3 3 A 4 4 4
4 1 B 6 1 2
5 2 B 3 2 3
6 3 B 4 4 4
Sample data:
df <- structure(list(Time = 1:3, A_X = c(2L, 2L, 4L), A_Y = c(4L, 3L,
4L), A_Z = c(5L, 2L, 4L), B_X = c(6L, 3L, 4L), B_Y = c(1L, 2L,
4L), B_Z = 2:4), .Names = c("Time", "A_X", "A_Y", "A_Z", "B_X",
"B_Y", "B_Z"), class = "data.frame", row.names = c(NA, -3L))
Here is another dplyr and tidyr solution.
df %>%
gather(ID, value, -Time) %>%
separate(ID, into = c("ID", "var")) %>%
spread(var, value) %>%
arrange(ID) %>%
select(ID, Time, X, Y, Z)
# ID Time X Y Z
# 1 A 1 2 4 5
# 2 A 2 2 3 2
# 3 A 3 4 4 4
# 4 B 1 6 1 2
# 5 B 2 3 2 3
# 6 B 3 4 4 4