I am trying to unite two columns but the data Concatenates from each column. I would like to only keep the data from the second column B and remove any of the A data
column data. Is there a way to do this without 1) Deleting the data manually from the 1st col first? Thank you!
df1 <- data.frame(A = c(1,1,1,1,1,1,1,1),
B = c(2,2,2,2,2,2,2,2))
df1 %>% unite("A",A,B,remove = TRUE,na.rm = TRUE)
The out returns 1_2 in the combine new column 'A'
Do you want to keep only column B where a value is present, otherwise use column A, like this? So you end up with one column (C) that coalesces A and B?
If you want to delete column A, then just add %>% select(-A).
library(tidyverse)
data.frame(
A = c(1, 1, 1, 1, 1, 1, 1, 1),
B = c(2, 2, 2, 2, 2, 2, NA, 2)
) %>%
mutate(C = coalesce(B, A))
#> A B C
#> 1 1 2 2
#> 2 1 2 2
#> 3 1 2 2
#> 4 1 2 2
#> 5 1 2 2
#> 6 1 2 2
#> 7 1 NA 1
#> 8 1 2 2
Created on 2022-05-17 by the reprex package (v2.0.1)
Related
I have a longitudinal data set in wide format, with > 2500 columns. Almost all columns begin with 'W1_' or 'W2_' to indicate the wave (ie, time point) of data collection. In the real data, there are > 2 waves. They look like this:
# Populate wide format data frame
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
wide
#> person W1_resp_sex W2_resp_sex W1_edu W2_q_2_1
#> 1 1 1 1 1 0
#> 2 2 2 2 2 1
#> 3 3 1 1 3 1
#> 4 4 2 2 4 0
I want to reshape from wide to long format so that the data look like this:
# Populate long data frame (this is how we want the wide data above to look after reshaping it)
person <- c(1, 1, 2, 2, 3, 3, 4, 4)
wave <- c(1, 2, 1, 2, 1, 2, 1, 2)
sex <- c(1, 1, 2, 2, 1, 1, 2, 2)
education <- c(1, NA, 2, NA, 3, NA, 4, NA)
q_2_1 <- c(NA, 0, NA, 1, NA, 1, NA, 0)
long_goal <- as.data.frame(cbind(person, wave, sex, education, q_2_1))
long_goal
#> person wave sex education q_2_1
#> 1 1 1 1 1 NA
#> 2 1 2 1 NA 0
#> 3 2 1 2 2 NA
#> 4 2 2 2 NA 1
#> 5 3 1 1 3 NA
#> 6 3 2 1 NA 1
#> 7 4 1 2 4 NA
#> 8 4 2 2 NA 0
To reshape the data, I tried pivot_longer(). How do I fix these issues?
(I prefer not to use data.table.)
The variables have different naming patterns (How can I correctly specify names_pattern() ?)
The multiple columns (see how all values are under the 'sex' column)
Creating a column with 'NA' when a variable was only collected in one wave (ie, if it was only collected in wave 2, I want a column with W1_varname in which all values are NA).
# Re-load wide format data
person <- c(1, 2, 3, 4)
W1_resp_sex <- c(1, 2, 1, 2)
W2_resp_sex <- c(1, 2, 1, 2)
W1_edu <- c(1, 2, 3, 4)
W2_q_2_1 <- c(0, 1, 1, 0)
wide <- as.data.frame(cbind(person, W1_resp_sex, W2_resp_sex, W1_edu, W2_q_2_1))
# Load package
pacman::p_load(tidyr)
# Reshape from wide to long
long <- wide %>%
pivot_longer(
cols = starts_with('W'),
names_to = 'Wave',
names_prefix = 'W',
names_pattern = '(.*)_',
values_to = 'sex',
values_drop_na = TRUE
)
long
#> # A tibble: 16 × 3
#> person Wave sex
#> <dbl> <chr> <dbl>
#> 1 1 1_resp 1
#> 2 1 2_resp 1
#> 3 1 1 1
#> 4 1 2_q_2 0
#> 5 2 1_resp 2
#> 6 2 2_resp 2
#> 7 2 1 2
#> 8 2 2_q_2 1
#> 9 3 1_resp 1
#> 10 3 2_resp 1
#> 11 3 1 3
#> 12 3 2_q_2 1
#> 13 4 1_resp 2
#> 14 4 2_resp 2
#> 15 4 1 4
#> 16 4 2_q_2 0
Created on 2022-09-19 by the reprex package (v2.0.1)
We could reshape to 'long' with pivot_longer, specifying the names_pattern to capture substring from column names ((...)) that matches with the same order of names_to - i.e.. wave column will get the digits (\\d+) after the 'W', where as the .value (value of the columns) correspond to the substring after the first _ in column names. Then, we could modify the resp_sex and edu by column names
library(dplyr)
library(tidyr)
pivot_longer(wide, cols = -person, names_to = c("wave", ".value"),
names_pattern = "^W(\\d+)_(.*)$") %>%
rename_with(~ c("sex", "education"), c("resp_sex", "edu"))
-output
# A tibble: 8 × 5
person wave sex education q_2_1
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 1 1 NA
2 1 2 1 NA 0
3 2 1 2 2 NA
4 2 2 2 NA 1
5 3 1 1 3 NA
6 3 2 1 NA 1
7 4 1 2 4 NA
8 4 2 2 NA 0
You want to reshape the variables that are measured in both waves. You may find them tableing the substring of the names without prefix.
v <- grep(names(which(table(substring(names(wide)[-1], 4)) == 2)), names(wide))
reshape2::melt(data=wide, id.vars=1, measure.vars=v)
# person variable value
# 1 1 W1_resp_sex 1
# 2 2 W1_resp_sex 2
# 3 3 W1_resp_sex 1
# 4 4 W1_resp_sex 2
# 5 1 W2_resp_sex 1
# 6 2 W2_resp_sex 2
# 7 3 W2_resp_sex 1
# 8 4 W2_resp_sex 2
I have a column in a data frame (here named "a") where starts of an sequence are marked with 1, while subsequent incidents, belonging to the same sequence are marked with N/A. Now I would like to create a new column ("b") to index all incidents belonging to the same sequence (1:n) and then create a third column ("c") with numbers indicating which incidents belong to the same sequence.
I am sure the solution is very easy and striking once I see it, however, at the moment I just don't manage to come up with an idea myself how to best solve this. Also other questions did not cover my question, as far as I have seen.
Usually I am using dplyr (I also need to do some group_by with my data, which in reality is more complex than I outlined here), so I would be very happy about a dplyr solution if possible!
Example code to start with:
df <- data.frame("a"= c(1, NA, NA, NA, 1, NA, 1, 1, 1))
How it should look like in the end:
df_final <- data.frame("a"= c(1, NA, NA, NA, 1, NA, 1, 1, 1), "b"= c(1, 2, 3, 4, 1, 2, 1, 1, 1), "c" = c(1, 1, 1, 1, 2, 2, 3, 4, 5))
EDIT
Since the question has changed now, getting expected output is more simple now
library(dplyr)
df %>%
group_by(c = cumsum(!is.na(a))) %>%
mutate(b = row_number())
# a c b
# <dbl> <int> <int>
#1 1 1 1
#2 NA 1 2
#3 NA 1 3
#4 NA 1 4
#5 1 2 1
#6 NA 2 2
#7 1 3 1
#8 1 4 1
#9 1 5 1
And using base R that would be :
df$c <- cumsum(!is.na(df$a))
df$b <- with(df, ave(a, c, FUN = seq_along))
Original Answer
Unfortunately, the grouping for creation of b and c is different. For b we group_by sequential non-NA values and take cumulative over them and then generate a row_number for every group. For c we take rle on non-NA values and repeat the group values lengths times.
library(dplyr)
df %>%
group_by(group = cumsum(!is.na(a))) %>%
mutate(b = row_number()) %>%
ungroup() %>%
select(-group) %>%
mutate(c = with(rle(!is.na(a)), rep(cumsum(values), lengths)))
# A tibble: 9 x 3
# a b c
# <dbl> <int> <int>
#1 1 1 1
#2 NA 2 1
#3 NA 3 1
#4 NA 4 1
#5 1 1 2
#6 NA 2 2
#7 1 1 3
#8 1 1 3
#9 1 1 3
Of course this is not dplyr specific answer and can be answered with base R as well
df$b <- with(df, ave(a, cumsum(!is.na(a)), FUN = seq_along))
df$c <- with(df, with(rle(!is.na(a)), rep(cumsum(values), lengths)))
I have two vectors which contain indices which look like
index A index B
1 1
1 1
1 1
1 2
1 2
2 1
2 1
Now, I want to find the length of each combination between index A and index B. So, in my example there are three unique combinations for index A and index B and I want to get back 3, 2, 2 in a vector. Does anyone know how to this without a for loop?
EDIT:
So, in this example there are three unique combinations (1 1, 1 2 and 2 1) for which the there are 3 of combination 1 1, 2 of 1 2 and 2 of 2 1. Therefore, I want to return 3, 2, 2
I think this is what you want:
library(plyr)
df <- data.frame(index_A = c(1, 1, 1, 1, 1, 2, 2),
index_B = c(1, 1, 1, 2, 2, 1, 1))
count(df, vars = c("index_A", "index_B"))
#> index_A index_B freq
#> 1 1 1 3
#> 2 1 2 2
#> 3 2 1 2
Created on 2019-03-17 by the reprex package (v0.2.1)
I got this from here.
In base R, we can use table
as.data.frame(table(dat))
You could paste the vectors together and call rle
rle(do.call(paste0, dat))$lengths
# [1] 3 2 2
If you need the result as a data.frame, do
as.data.frame(unclass(rle(do.call(paste0, dat))))
# lengths values
#1 3 11
#2 2 12
#3 2 21
data
text <- "indexA indexB
1 1
1 1
1 1
1 2
1 2
2 1
2 1"
dat <- read.table(text = text, header = TRUE)
This is somehow hacky:
library(dplyr)
df %>%
mutate(Combined=paste0(`index A`,"_",`index B`)) %>%
group_by(Combined) %>%
summarise(n=n())
# A tibble: 3 x 2
Combined n
<chr> <int>
1 1_1 3
2 1_2 2
3 2_1 2
Can actually just do:
df %>%
group_by(`index A`,`index B`) %>%
summarise(n=n())
Adding tidyr unite as suggested by #kath
library(tidyr)
df %>%
unite(new_col,`index A`,`index B`,sep="_") %>%
add_count(new_col) %>%
unique()
Data:
df<-read.table(text="index A index B
1 1
1 1
1 1
1 2
1 2
2 1
2 1",header=T,as.is=T,fill=T)
df<-df[,1:2]
names(df)<-c("index A","index B")
Using dplyr :
library(dplyr)
count(dat,!!!dat)$n
# [1] 3 2 2
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
#> a b
#> 1 1 4
#> 2 2 5
#> 3 3 6
I figured out how to overwrite data with base R. Maybe that was day 1 of learning R.
df[2:3, 2] <- c(50, 60)
#> a b
#> 1 1 4
#> 2 2 50
#> 3 3 60
I never found an easy way to do it with dplyr. How do I overwrite data with the pipe %>%?
We can use replace within mutate. If we can use the column names, i.e. 'b', replace the 'b' by specifying the list parameter in replace with the index of rows and the values as a vector
library(dplyr)
df %>%
mutate(b = replace(b, 2:3, c(50, 60)))
# a b
#1 1 4
#2 2 50
#3 3 60
Or specify the index of columns in mutate_at
df %>%
mutate_at(2, replace, list = 2:3, values = c(50, 60))
Newbie question
I have 2 columns in a data frame that looks like
Name Size
A 1
A 1
A 1
A 2
A 2
B 3
B 5
C 7
C 17
C 17
I need a third column that will run continuously as a sequence until either Name or Size changes value
Name Size NewCol
A 1 1
A 1 2
A 1 3
A 2 1
A 2 2
B 3 1
B 5 1
C 7 1
C 17 1
C 17 2
Basically a dummy field to reference each record separately even if Name and Size are the same.
So the index changes from k to k+1 when it encounters both same values for Name and Size otherwise resets.
Therefore in my data set if I have 200 A and 1s suppose each will be indexed between 1..200. Then when it moves to A and 2 the index shall reset
We can try with data.table
library(data.table)
setDT(df1)[, NewCol := match(Size, unique(Size)), by = .(Name)]
df1
# Name Size NewCol
#1: A 1 1
#2: A 1 1
#3: A 2 2
#4: B 3 1
#5: C 7 1
#6: C 17 2
If there is a typo somewhere in the expected output, may be this would be the output
setDT(df1)[, NewCol := seq_len(.N), .(Name, Size)]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Name) %>%
mutate(NewCol = match(Size, unique(Size)))
Or
df1 %>%
group_by(Name) %>%
mutate(NewCol = row_number())
Or we can use the same approach with ave from base R
I guess this might not be the most efficient solution, but at least a good start :
# Reproducing the example
df <- data.frame(Name=LETTERS[c(1, 1, 1, 1, 1, 2, 2, 3, 3, 3)], Size=c(1, 1, 1, 2, 2, 3, 5, 7, 17, 17))
# Create new colum with unique id
df$NewCol <- paste0(df$Name, df$Size)
# Modify column to write count instead
df$NewCol <- unlist(sapply(unique(df$NewCol), function(id) 1:table(df$NewCol)[id]))
df
Name Size NewCol
1 A 1 1
2 A 1 2
3 A 1 3
4 A 2 1
5 A 2 2
6 B 3 1
7 B 5 1
8 C 7 1
9 C 17 1
10 C 17 2