subtracting the greater column from smaller columns in a dataframe in R - r

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.

We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

Related

How to replace NA with cero in a columns, if the columns beside have a values? using R

I want to know a way to replace the NA of a column if the columns beside have a value, this because, using a example if the worker have values in the other columns mean he went to work that day so if he have an NA it means that should be replaced with cero, and if there are no values in the columns surrounding means he didnt go to work that day and the NA is correct
I have been doing this by sorting the other columns but its so time consuming
A sample of my data called df, the real one have 30 columns and like 30,000 rows
df <- data.frame(
hours = c(NA, 3, NA, 8),
interactions = c(NA, 3, 9, 9),
sales = c(1, 1, 1, NA)
)
df$hours2 <- ifelse(
test = is.na(df$hours) & any(!is.na(df[,c("interactions", "sales")])),
yes = 0,
no = df$hours)
df
hours interactions sales hours2
1 NA NA 1 0
2 3 3 1 3
3 NA 9 1 0
4 8 9 NA 8
You could also do as follows:
library(dplyr)
mutate(df, X = if_else(is.na(hours) | is.na(interactions), 0, hours))
# hours interactions sales X
# 1 NA NA 1 0
# 2 3 3 1 3
# 3 NA 9 1 0
# 4 8 9 NA 8

Replicate rows with missing values and replace missing values by vector

I have a dataframe in which a column has some missing values.
I would like to replicate the rows with the missing values N times, where N is the length of a vector which contains replacements for the missing values.
I first define a replacement vector, then my starting data.frame, then my desired result and finally my attempt to solve it. Unfortunately that didn't work...
> replace_values <- c('A', 'B', 'C')
> data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
value result
1 3 5
2 4 3
3 NA 1
4 NA 2
> data.frame(value = c(3, 4, replace_values, replace_values), result = c(5, 3, rep(1, 3),rep(2, 3)))
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
> t <- data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
> mutate(t, value = ifelse(is.na(value), replace_values, value))
value result
1 3 5
2 4 3
3 C 1
4 A 2
You can try a tidyverse solution
d %>%
mutate(value=ifelse(is.na(value), paste0(replace_values, collapse=","), value)) %>%
separate_rows(value, sep=",") %>%
select(value, everything())
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
The idea is to replace the NA's by the ,-collapsed 'replace_values'. Then separate the collpased values and binding them by row using tidyr's separate_rows function. Finally sort the data.frame according your expected output.
We can do an rbind here using base R. Create a logical vector where the 'value' is NA ('i1'), get the number of NA elements by taking the sum of it ('n'), create a data.frame by replicating the 'replace_values' with 'n' as well as the 'result' elements that correspond to the NA elements of 'value' by the length of 'replace_values' and 'rbind' with the subset of dataset i.e. the non-NA elements of 'value' rows
i1 <- is.na(df1$value)
n <- sum(i1)
rbind(df1[!i1,],
data.frame(value = rep(replace_values, n),
result = rep(df1$result[i1], each = length(replace_values))))
# value result
#1 3 5
#2 4 3
#3 A 1
#4 B 1
#5 C 1
#6 A 2
#7 B 2
#8 C 2

Merging rows based on multiple variables [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Working with a dataset that looks like this:
UserID PartnerID Happiness Result
1 2 30 1
2 1 20 1
As you can see this is repetitive. I'd like to take those two rows above and merge them into a single row. I have searched around but haven't found a solution that would work here. My ideal output would be this:
UserID PartnerID Happiness1 Happiness2 Result
1 2 30 20 1
If you have no aversion to using packages, I would recommend you use tidyverse for this. The following piece of code should get your desired output:
#install.packages("devtools")
#devtools::install_github("hadley/tidyverse")
library(tidyverse)
# Create a data.frame
dff <- structure(list(UserID = c(1, 2, 3, 4, 5, 6),
PartnerID = c(2,1, 4, 3, 6, 5),
Happiness = c(30, 20, 40, 50, 30, 20),
Result = c(1, 1, 1, 1, 1, 1)),
.Names = c("UserID", "PartnerID", "Happiness","Result"),
row.names = c(NA, 6L),
class = "data.frame")
# UserID PartnerID Happiness Result
# 1 2 30 1
# 2 1 20 1
# 3 4 40 1
# 4 3 50 1
# 5 6 30 1
# 6 5 20 1
# Reshape the data.frame
dff %>% mutate(grouper = paste(UserID,
PartnerID,
sep = "")) %>%
mutate(grouper = unlist(map(strsplit(grouper,""),
function(x) paste0(sort(x),
collapse="")))) %>%
group_by(grouper) %>%
mutate(Happiness = toString(Happiness)) %>%
ungroup() %>%
dplyr::filter(!duplicated(grouper)) %>%
separate(Happiness, into = c("Happiness1","Happiness2")) %>%
select(-grouper)
This solution uses chained operations with the help of the %>% operator.
The idea here is to create a grouping column (called grouper) by first concatenating the UserID and PartnerID columns, and then sorting the characters in each row. At this point, the grouper column should contain the ID of the user and the ID of their partner in a sorted order. This means that both the user and their partner have the values in the grouper column. Therefore, you can go ahead and use the group_by function from tidyverse to group your data by the grouper column. Once you have been able to group the data, you can mutate the Happiness column to a string (that's what the toString function is doing). Then at this point you can ungroup and filter out the duplicates. Once the duplicates are taken out, you can separate the Happiness column into two different columns: Happiness1 and Happiness2. Ultimately, you can drop the grouper column by using select(-grouper).
That should yield:
# UserID PartnerID Happiness1 Happiness2 Result
# 1 2 30 20 1
# 3 4 40 50 1
# 5 6 30 20 1
I hope this helps.
Maybe something like this, suppose your data is (I just added more toy data for the sake of clarity):
> df
# UserID PartnerID Happiness Result
# 1 4 30 1
# 2 3 20 0
# 3 2 10 0
# 4 1 15 1
#10 13 20 1
# 13 10 25 1
# 5 6 10 0
# 11 12 10 1
# 6 5 10 0
# 12 11 15 1
Then this:
dups <- duplicated(t(apply(df[,c(1,2)],1,sort)))
cbind(df[, c(1,3)], df[match(df$UserID,df$PartnerID), c(1,3,4)])[dups,]
Will give you your desired output:
# UserID Happiness UserID Happiness Result
# 3 10 2 20 0
# 4 15 1 30 1
# 13 25 10 20 1
# 6 10 5 10 0
# 12 15 11 10 1

R - Counting the number of a specific value in bins

I have a data frame (df) like below:
Value <- c(1,1,0,2,1,3,4,0,0,1,2,0,3,0,4,5,2,3,0,6)
Sl <- c(1:20)
df <- data.frame(Sl,Value)
> df
Sl Value
1 1 1
2 2 1
3 3 0
4 4 2
5 5 1
6 6 3
7 7 4
8 8 0
9 9 0
10 10 1
11 11 2
12 12 0
13 13 3
14 14 0
15 15 4
16 16 5
17 17 2
18 18 3
19 19 0
20 20 6
I would like to create 4 bins out of df and count the occurrences of Value=0 grouped by Sl values in a separate data frame like below:
Bin Count
1 1
2 2
3 2
4 1
I was trying to use table and cut to create the desire data frame but its not clear how I'll specify df$Value and the logic to find the 0s here
df.4.cut <- as.data.frame(table(cut(df$Sl, breaks=seq(1,20, by=5))))
Using your df
tapply(df$Value, cut(df$Sl, 4), function(x) sum(x==0))
gives
> tapply(df$Value, cut(df$Sl, 4), function(x) sum(x==0))
(0.981,5.75] (5.75,10.5] (10.5,15.2] (15.2,20]
1 2 2 1
In cut you can specify the number of breaks or the breaks itself if you prefer and the logic is in the function definition in tapply
Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), using cut output as the grouping variable, we get the sum of 'Value' that are '0' (!Value). By negating (!), the column will be converted to logical vector i.e. TRUE for 0 and FALSE all other values not equal to 0.
library(data.table)
setDT(df)[,sum(!Value) , .(gr=cut(Sl,breaks=seq(0,20, 5)))]
# gr V1
#1: (0,5] 1
#2: (5,10] 2
#3: (10,15] 2
#4: (15,20] 1
Your question used table(), but it lacked a second argument. It is needed to produce a contingency table. You can find the count of each bin with :
table(cut(df$Sl,4),df$Value)
0 1 2 3 4 5 6
(0.981,5.75] 1 3 1 0 0 0 0
(5.75,10.5] 2 1 0 1 1 0 0
(10.5,15.2] 2 0 1 1 1 0 0
(15.2,20] 1 0 1 1 0 1 1
And the count of Value == 0 for each bin :
table(cut(df$Sl,4),df$Value)[,"0"]
(0.981,5.75] (5.75,10.5] (10.5,15.2] (15.2,20]
1 2 2 1
A more convoluted way using sqldf :
First we create a table defining the bins and ranges (min and max):
bins <- data.frame(id = c(1, 2, 3, 4),
bins = c("(0,5]", "(5,10]", "(10,15]", "(15,20]"),
min = c(0, 6, 11, 16),
max = c(5, 10, 15, 20))
id bins min max
1 1 (0,5] 0 5
2 2 (5,10] 6 10
3 3 (10,15] 11 15
4 4 (15,20] 16 20
Then we use the following query using both tables to bin each sl into its respective group using BETWEEN for those Value equal to 0.
library(sqldf)
sqldf("SELECT bins, COUNT(Value) AS freq FROM df, bins
WHERE (((sl) BETWEEN [min] AND [max]) AND Value = 0)
GROUP BY bins
ORDER BY id")
Output:
bins freq
1 (0,5] 1
2 (5,10] 2
3 (10,15] 2
4 (15,20] 1
Another alternative to simplify the construction of bins suggested by mts using cut, extracting the levels of the factor:
bins <- data.frame(id = 1:4,
bins = levels(cut(Sl, breaks = seq(0, 20, 5))),
min = seq(1, 20, 5),
max = seq(5, 20, 5))

Select first observed data and utilize mutate

I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score.
The problem with asking for the first observation minus the last observation is that sometimes the first observation data is missing.
Is there anyway to ask for the first observed score for each individual, thus skipping any missing data?
I built the below df to illustrate my problem.
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
id ob score
1 5 1 NA
2 5 2 2
3 5 3 3
4 5 4 4
5 5 5 3
6 12 1 7
7 12 2 3
8 12 3 4
9 17 1 3
10 17 2 4
11 20 1 NA
12 20 2 1
13 20 3 4
And what I am hoping to run is code that will give me...
id ob score es
1 5 1 NA -1
2 5 2 2 -1
3 5 3 3 -1
4 5 4 4 -1
5 5 5 3 -1
6 12 1 7 3
7 12 2 3 3
8 12 3 4 3
9 17 1 3 -1
10 17 2 4 -1
11 20 1 NA -3
12 20 2 1 -3
13 20 3 4 -3
I am attempting to work out of dplyr and I understand the use of the 'group_by' command, however, not sure how to 'select' only first observed scores and then mutate to create es.
I would use first() and last() (both dplyr function) and na.omit() (from the default stats package.
First, I would make sure your score column was a numberic column with proper NA values (not strings as in your example)
help <- data.frame(id = c(5,5,5,5,5,12,12,12,17,17,20,20,20),
ob = c(1,2,3,4,5,1,2,3,1,2,1,2,3),
score = c(NA, 2, 3, 4, 3, 7, 3, 4, 3, 4, NA, 1, 4))
then you can do
library(dplyr)
help %>% group_by(id) %>% arrange(ob) %>%
mutate(es=first(na.omit(score)-last(na.omit(score))))
library(dplyr)
temp <- help %>% group_by(id) %>%
arrange(ob) %>%
filter(!is.na(score)) %>%
mutate(es = first(score) - last(score)) %>%
select(id, es) %>%
distinct()
help %>% left_join(temp)
This solution is a little verbose, only b/c it relies on a couple of helper functions FIRST and LAST:
# The position (indicator) of the first value that evaluates to TRUE.
LAST <- function (x, none = NA) {
out <- FIRST(reverse(x), none = none)
if (identical(none, out)) {
return(none)
}
else {
return(length(x) - out + 1)
}
}
# The position (indicator) of the last value that evaluates to TRUE.
FIRST <- function (x, none = NA)
{
x[is.na(x)] <- FALSE
if (any(x))
return(which.max(x))
else return(none)
}
# returns the difference between the first and last non-missing values
diff2 <- function(x)
x[LAST(!is.na(x))] - x[FIRST(!is.na(x))]
library(dplyr)
help %>%
group_by(id) %>%
arrange(ob) %>%
summarise(diff = diff2(score))

Resources