I want to add grouping column to my data. My data is text column and there is NA separating groups. Here is example and group is result I would like to achieve. I don't know how many rows each group will consist but there is always NA separating groups (except last group). So how can I create group column?
library(tidyverse)
data <- tibble(raw = c("This", "Is", "First", NA, "This", "Is", "Second", NA, "And", "Third"),
group = c(1,1,1,1,2,2,2,2,3,3))
Take the cumulative sum of the NAs and add one if the current value is not NA.
data %>% mutate(group = cumsum(is.na(raw)) + !is.na(raw))
One option is to create a logical vector based on the NA value and use cumsum
library(dplyr)
data %>%
mutate(groupNew = cumsum(lag(is.na(raw), default = TRUE)) )
# A tibble: 10 x 3
# raw group groupNew
# <chr> <dbl> <int>
# 1 This 1 1
# 2 Is 1 1
# 3 First 1 1
# 4 <NA> 1 1
# 5 This 2 2
# 6 Is 2 2
# 7 Second 2 2
# 8 <NA> 2 2
# 9 And 3 3
#10 Third 3 3
Related
I have a dataset with a number of cases. Every case has two observations. The first observation for case number 1 has value 3 and the second observation has value 7. The two observations for case number 2 have missing values. I need to write code to fill the empty cells with the same values from case number 1 so that the first row for case 2 will have the same value as case 1 for obs = 1 and the second row will have the same value for obs = 2. Of course, this is a very short version of a much bigger dataset so I need something that is flexible enough to accommodate for a couple of hundred cases and where the values to use as fillers change for every subjects.
Here is a toy data set:
# toy dataset
df <- data.frame(
case = c(1, 1, 2, 2),
obs = c(1, 2, NA, NA),
value = c(3, 7, NA, NA)
)
# case obs value
# 1 1 1 3
# 2 1 2 7
# 3 2 NA NA
# 4 2 NA NA
#Desired output:
case obs value
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
We may use fill with grouping on the row sequence (rowid) of case
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rowid(case)) %>%
fill(obs, value) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 × 3
case obs value
<dbl> <dbl> <dbl>
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7
Im trying to manipulate my data such that the values in each column become column headings, and the values for each of these columns are their previous column name.
My table can be produced with the following code:
library(tidyverse)
df <- tibble("#" = c(1,2,3),
"1" = c("a","a","b"),
"2" = c("b","b","c"),
"3" = c("c","c","a"))
My current table looks like this:
# 1 2 3
1 a b c
2 a b c
3 b c a
And I want it to look like this:
# a b c
1 1 2 3
2 1 2 3
3 3 1 2
You could use order applied to each row of the tibble:
## new column names
names(df) <- c("#", sort(unlist(df[1, -1], use.names = FALSE)))
## apply order to each row of df
df[, -1] <- t(apply(df[, -1], 1, order))
df
#> # A tibble: 3 x 4
#> `#` a b c
#> <dbl> <int> <int> <int>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 3 1 2
Disclaimer: this assumes that each row contains a permutation of all available variables. If this is not the case, (e.g. twice the letter a in a single row), there would be an issue assigning individual values to each column.
I have a dataframe called "data". One of the columns is called "reward" and another is called "X.targetResp". I want to create a new dataframe, called "reward", that consists of all values from the column "reward" in "data." HOWEVER, I want to exclude values of the "reward" column that are in the same row as an NA value in the "X.targetResp" column of "data".
I've tried the following:
reward <- data$reward %in% filter(!is.na(data$X.targetResp))
reward <- subset(data, reward, !(X.targetResp=="NA"))
reward <- subset(data, reward, !is.na(X.targetResp))
...but I get errors for each of them.
Thanks for your input!
In dplyr, you can use filter and !is.na() to filter out the ones with NA in X.targetResp, and then use the select function to select the reward column.
library(dplyr)
# Create example data frame
dat <- data_frame(reward = 1:5,
X.targetResp = c(2, 4, NA, NA, 10))
# Print the data frame
dat
# # A tibble: 5 x 2
# reward X.targetResp
# <int> <dbl>
# 1 1 2
# 2 2 4
# 3 3 NA
# 4 4 NA
# 5 5 10
# Use the filter function
reward <- dat %>%
filter(!is.na(X.targetResp)) %>%
select(reward)
reward
# # A tibble: 3 x 1
# reward
# <int>
# 1 1
# 2 2
# 3 5
And here is a base R solution with the similar logic.
subset(dat, !is.na(X.targetResp), "reward")
# A tibble: 3 x 1
reward
# <int>
# 1 1
# 2 2
# 3 5
You can also consider use drop_na on X.targetResp from the tidyr.
library(dplyr)
library(tidyr)
reward <- dat %>%
drop_na(X.targetResp) %>%
select(reward)
reward
# # A tibble: 3 x 1
# reward
# <int>
# 1 1
# 2 2
# 3 5
Here is an example of the data.table package.
library(data.table)
setDT(dat)
reward <- dat[!is.na(X.targetResp), .(reward)]
reward
# reward
# 1: 1
# 2: 2
# 3: 5
You can simply use na.omit, which is designed to address this problem:
# replicating the same example data frame given by #www
data <- data.frame(
reward = 1:5,
X.targetResp = c(2, 4, NA, NA, 10)
)
# omitting the rows containing NAs
reward <- na.omit(data)
# resulting data frame with both columns
reward
# reward X.targetResp
# 1 1 2
# 2 2 4
# 5 5 10
# you can easily extract the first column if necessary
reward[1]
# reward
# 1 1
# 2 2
# 5 5
Following up #www's comment:
In case there are other columns you want to dodge:
# omitting the rows where only X.targetResp is NA
reward <- data[complete.cases(data["X.targetResp"]), ]
# resulting data frame with both columns
reward
# reward X.targetResp
# 1 1 2
# 2 2 4
# 5 5 10
# you can easily extract the first column if necessary
reward[1]
# reward
# 1 1
# 2 2
# 5 5
Assume I have a data frame like so:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of x, y, z vectors take value of zero, the rest being drawn from the uniform distribution between 0 and 1.
For each group, determined by the first column, I want to find three IDs from the second column, pointing to the highest value of x, y, z variables in the group. Assume there are no draws except for the cases in which a variable takes a value of 0 in all observations of a given group - in that case I don't want to return any number as an id of a row with maximum value.
The output would look like so:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select rows with maximum values separately for each variable and then use merge to put it in one table. However, I'm wondering if it can be done without merge, for example with standard dplyr functions.
Here is my proposed solution using plyr:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
A solution uses dplyr and tidyr. Notice that if all numbers are the same, we cannot decide which id should be selected. So filter(n_distinct(Value) > 1) is added to remove those records. In the final output df2, NA indicates such condition where all numbers are the same. We can decide whether to impute those NA later if we want. This solution should work for any numbers of id or columns (x, y, z, ...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
If you want to stick with just dplyr, you can use the multiple-column summarize/mutate functions. This should work regardless of the form of id; my initial attempt was slightly cleaner but assumed that an id of zero was invalid.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2
I have two data frames. Data frame A has many observations/rows, an ID for each observation, and many additional columns. For a subset of observations X, the values for a set of columns are missing/NA. Data frame B contains a subset of the observations in X (which can be matched across data frames using the ID) and variables with identical names as in data frame A, but containing values to replace the missing values in the set of columns with missing/NA.
My code below (using a join operation) merely adds columns rather than replacing missing values. For each of the additional variables (let's name them W) in B, the resulting table produces W.x and W.y.
library(dplyr)
foo <- data.frame(id = seq(1:6), x = c(NA, NA, NA, 1, 3, 8), z = seq_along(10:15))
bar <- data.frame(id = seq(1:2), x = c(10, 9))
dplyr::left_join(x = foo, y = bar, by = "id")
I am trying to replace the missing values in A using the values in B based on the ID, but do so in an efficient manner since I have many columns and many rows. My goal is this:
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
One thought was to use ifelse() after joining, but typing out ifelse() functions for all of the variables is not feasible. Is there a way to do this simply without the database join or is there a way to apply a function across all columns ending in .x to replace the values in .x with the value in .y if the value in .x is missing?
Another attempt which should essentially only be one assignment operation. Using #alistaire's data again:
vars <- c("x","y")
foo[vars] <- Map(pmax, foo[vars], bar[match(foo$id, bar$id), vars], na.rm=TRUE)
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
EDIT
Updating the answer taking #alistaire 's example dataframe.
We can extend the same answer given below using mapply so that it can handle multiple columns for both foo and bar.
Finding out common columns between two dataframes and sorting them so they are in the same order.
vars <- sort(intersect(names(foo), names(bar))[-1])
foo[vars] <- mapply(function(x, y) {
ind = is.na(x)
replace(x, ind, y[match(foo$id[ind], bar$id)])
}, foo[vars], bar[vars])
foo
# id x y z
#1 1 10 1 1
#2 2 9 2 2
#3 3 NA 3 3
#4 4 1 4 4
#5 5 3 5 5
#6 6 8 6 6
Original Answer
I think this does what you are looking for :
foo[-1] <- sapply(foo[-1], function(x) {
ind = is.na(x)
replace(x, ind, bar$x[match(foo$id[ind], bar$id)])
})
foo
# id x z
#1 1 10 1
#2 2 9 2
#3 3 NA 3
#4 4 1 4
#5 5 3 5
#6 6 8 6
For every column (except id) we find the missing value in foo and replace it with corresponding values from bar.
If you don't mind verbose baseR approaches, then you can easily accomplish this using merge() and a careful subsetting of your data frame.
df <- merge(foo, bar, by="id", all.x=TRUE)
names(df) <- c("id", "x", "z", "y")
df$x[is.na(df$x)] <- df$y[is.na(df$x)]
df <- df[c("id", "x", "z")]
> df
id x z
1 1 10 1
2 2 9 2
3 3 NA 3
4 4 1 4
5 5 3 5
6 6 8 6
You can iterate dplyr::coalesce over the intersect of non-grouping columns. It's not elegant, but it should scale reasonably well:
library(tidyverse)
foo <- data.frame(id = seq(1:6),
x = c(NA, NA, NA, 1, 3, 8),
y = 1:6, # add extra shared variable
z = seq_along(10:15))
bar <- data.frame(id = seq(1:2),
y = c(1L, NA),
x = c(10, 9))
# names of non-grouping variables in both
vars <- intersect(names(foo), names(bar))[-1]
foobar <- left_join(foo, bar, by = 'id')
foobar <- vars %>%
map(paste0, c('.x', '.y')) %>% # make list of columns to coalesce
map(~foobar[.x]) %>% # for each set, subset foobar to a two-column data.frame
invoke_map(.f = coalesce) %>% # ...and coalesce it into a vector
set_names(vars) %>% # add names to list elements
bind_cols(foobar) %>% # bind into data.frame and cbind to foobar
select(union(names(foo), names(bar))) # drop duplicated columns
foobar
#> # A tibble: 6 x 4
#> id x y z
#> <int> <dbl> <int> <int>
#> 1 1 10 1 1
#> 2 2 9 2 2
#> 3 3 NA 3 3
#> 4 4 1 4 4
#> 5 5 3 5 5
#> 6 6 8 6 6