I have an issue in R I cannot fix, so I'm asking for help here. I want to merge three columns into one, but haven't found a way to do so. Let's say it looks like this table:
Time H C W K
0 1 2 0 5
1 5 2 1 1
2 0 1 2 2
How do I turn it into this table:
Time G K
0 3 5
1 8 1
2 3 2
Maybe you can try the code below
subset(within(df, G <- rowSums(cbind(H, C, W))), select = -c(H, C, W))
giving
Time K G
1 0 5 3
2 1 1 8
3 2 2 3
or a data.table option
> setDT(df)[, .(Time, G = rowSums(cbind(H, C, W)), K)][]
Time G K
1: 0 3 5
2: 1 8 1
3: 2 3 2
We can use transmute
library(dplyr)
df %>%
transmute(Time, G = rowSums(select(., H:W)), K)
# Time G K
#1 0 3 5
#2 1 8 1
#3 2 3 2
Maybe try this:
#Code
newdf <- data.frame(df[,1,drop=F],G=rowSums(df[,-c(1,5)]),df[,5,drop=F])
Output:
Time G K
1 0 3 5
2 1 8 1
3 2 3 2
Some data used:
#Data
df <- structure(list(Time = 0:2, H = c(1L, 5L, 0L), C = c(2L, 2L, 1L
), W = 0:2, K = c(5L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))
Also a shortcut instead of placing each variable and improving the answer of #KarthikS can be using c_across():
library(dplyr)
#Code2
newdf <- df %>% rowwise() %>% mutate(G = sum(c_across(H:W))) %>% select(Time, G, K)
Output:
# A tibble: 3 x 3
# Rowwise:
Time G K
<int> <int> <int>
1 0 3 5
2 1 8 1
3 2 3 2
Related
I have a dataset that looks like the following:
group y x
1 2 0
1 3 0
1 1 0
2 3 1
2 4 1
2 3 1
In the actual dataset, there are 180 groups (though they're not numbered from 1-180). The value of x is either 0 or 1 and is the same within each group. The value of y differs for each individual observation.
I am trying to get a random sample with replacement from the group column. Then, I would like to find a way to combine this with the original data. For example, if I randomly sample the group 1, I would like the final dataset to include all 3 observations included in group 1. If I randomly sample group 1 twice, I would like the final dataset to include each observation from group 1 twice.
Here's an example. If I imagine I have randomly sample 1, 1, and 2, I would like the final dataset to look like this:
group y x
1 2 0
1 3 0
1 1 0
1 2 0
1 3 0
1 1 0
2 3 1
2 4 1
2 3 1
When I sample like below, I get a list of values. I am not sure what to do next to get the results I am looking for.
clusters <- sample(df$group, 180, replace = TRUE)
In Excel, I would use vlookup() to do something like this.
Base R:
set.seed(42)
do.call(rbind, sample(split(dat, dat$group), size = 3, replace = TRUE))
# group y x
# 2.4 2 3 1
# 2.5 2 4 1
# 2.6 2 3 1
# 2.41 2 3 1
# 2.51 2 4 1
# 2.61 2 3 1
# 1.1 1 2 0
# 1.2 1 3 0
# 1.3 1 1 0
(The row names are not pretty, but they are harmless and ignored by most tools.)
Generically, and piece-wise, we see:
dat_spl <- split(dat, dat$group)
inds <- c(1, 1, 2)
### randomly this can be done with:
# inds <- sample(length(dat_spl), size = 3, replace = TRUE)
do.call(rbind, dat_spl[inds])
# group y x
# 1.1 1 2 0
# 1.2 1 3 0
# 1.3 1 1 0
# 1.11 1 2 0
# 1.21 1 3 0
# 1.31 1 1 0
# 2.4 2 3 1
# 2.5 2 4 1
# 2.6 2 3 1
If you want/need it to be pure-tidyverse, an alternative:
library(dplyr)
set.seed(42)
dat %>%
group_by(group) %>%
nest(dat = -group) %>%
ungroup() %>%
sample_n(3, replace = TRUE) %>%
unnest(dat)
# # A tibble: 9 x 3
# group y x
# <int> <int> <int>
# 1 2 3 1
# 2 2 4 1
# 3 2 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 3 1
# 7 1 2 0
# 8 1 3 0
# 9 1 1 0
Data:
dat <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L), y = c(2L, 3L,
1L, 3L, 4L, 3L), x = c(0L, 0L, 0L, 1L, 1L, 1L)), row.names = c(NA,
-6L), class = "data.frame")
I have a dataframe with two rows of data. I would like to add a third row with values representing the difference between the previous two.
The data is in the following format:
Month A B C D E F
Jan 1 2 4 8 4 1
Feb 1 1 4 5 2 0
I would like to add an extra row with a calculation to give the change across the two months:
Month A B C D E F
Jan 1 2 4 8 4 1
Feb 1 1 4 5 2 0
Change 0 1 0 3 2 1
I've been looking at various functions to add additional rows including rbind and mutate but I'm struggling to perform the calculation in the newly created row.
As it's just two rows you can subtract the individual rows and rbind the difference
rbind(df, data.frame(Month = "Change", df[1, -1] - df[2, -1]))
# Month A B C D E F
#1 Jan 1 2 4 8 4 1
#2 Feb 1 1 4 5 2 0
#3 Change 0 1 0 3 2 1
d1<-data.frame(Month = "Change" , df[1,-1] , df[2,-1])
newdf <- rbind(df,d1)
This will create a new data-frame with what you require
An option with tidyverse
library(tidyverse)
df1 %>%
summarise_if(is.numeric, diff) %>%
abs %>%
bind_rows(df1, .) %>%
mutate(Month = replace_na(Month, "Change"))
# Month A B C D E F
#1 Jan 1 2 4 8 4 1
#2 Feb 1 1 4 5 2 0
#3 Change 0 1 0 3 2 1
data
df1 <- structure(list(Month = c("Jan", "Feb"), A = c(1L, 1L), B = 2:1,
C = c(4L, 4L), D = c(8L, 5L), E = c(4L, 2L), F = 1:0),
class = "data.frame", row.names = c(NA,
-2L))
I have a dataframe in R that looks like the following:
a b c condition
1 4 2 acap
2 3 1 acap
2 4 3 acap
5 6 8 ncap
5 7 6 ncap
8 7 6 ncap
I am trying to recode the values in columns a, b, and c for condition ncap (and also 2 other conditions not pictured here) while leaving the values for acap alone.
The following code works when applied to the first 3 columns. I am trying to figure out how I can apply this only to rows that I specify by condition while keeping everything in the same dataframe.
df = df %>%
mutate_at(vars(a:c), function(x)
case_when x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4)
This is the expected output.
a b c condition
1 4 2 acap
2 3 1 acap
2 4 3 acap
1 2 4 ncap
1 3 2 ncap
4 3 2 ncap
I've looked around for an answer to this question and am unable to find it. If someone knows of an answer that already exists, I would appreciate being directed to it.
We can use the case_when on a condition created with row_number i.e. if the row number is 4 to 6, subtract 4 from the value or else return the value
df %>%
mutate_at(vars(a:c), funs(case_when(row_number() %in% 4:6 ~ . - 4L,
TRUE ~ .)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
If this is based on the value instead of the rows, create the condition on the value
df %>%
mutate_at(vars(a:c), funs(case_when(. %in% 5:8 ~ . - 4L,
TRUE ~ .)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
Or if it is based on the value in the 'condition'
df %>%
mutate_at(vars(a:c), funs(case_when(condition == 'ncap' ~ . - 4L,
TRUE ~ .)))
Or without using any case_when
df %>%
mutate_at(vars(a:c), funs( . - c(0, 4)[(condition == 'ncap')+1]))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
In base R, we can do this by creating the index
i1 <- df$condition =='ncap'
df[i1, 1:3] <- df[i1, 1:3] - 4
data
df <- structure(list(a = c(1L, 2L, 2L, 5L, 5L, 8L), b = c(4L, 3L, 4L,
6L, 7L, 7L), c = c(2L, 1L, 3L, 8L, 6L, 6L), condition = c("acap",
"acap", "acap", "ncap", "ncap", "ncap")), class = "data.frame",
row.names = c(NA, -6L))
You can use filter to apply recoding values to only specific rows (not equal to "acap" here)
library(dplyr)
df %>%
filter(condition != "acap") %>%
mutate_at(vars(a:c), function(x)
case_when(x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4))
# a b c condition
#1 1 2 4 ncap
#2 1 3 2 ncap
#3 4 3 2 ncap
If you need the entire dataframe back again we can do
df %>%
filter(condition == "acap") %>%
bind_rows(df %>%
filter(condition != "acap") %>%
mutate_at(vars(a:c), function(x)
case_when(x == 5 ~ 1, x == 6 ~ 2, x == 7 ~ 3, x == 8 ~ 4)))
# a b c condition
#1 1 4 2 acap
#2 2 3 1 acap
#3 2 4 3 acap
#4 1 2 4 ncap
#5 1 3 2 ncap
#6 4 3 2 ncap
I have a strange dataset format where a simple reshape function won't work. Assume I have three time periods (1-3); 2 id Names (A-B); and three variables (X,Y and Z) in the following format. Where the id names and variables name are seperated by -:
Time A-X A-Y A-Z B-X B-Y B-Z
1 2 4 5 6 1 2
2 2 3 2 3 2 3
3 4 4 4 4 4 4
Ideally, I would like to produce the dataset in the following format:
ID Time X Y Z
A 1 2 4 5
A 2 2 3 2
A 3 4 4 4
B 1 6 1 2
B 2 3 2 3
B 3 4 4 4
Which functions to use?
library(dplyr)
library(tidyr)
library(splitstackshape)
df %>%
gather(key, value, -Time) %>%
cSplit("key", sep="_") %>%
spread(key_2, value) %>%
rename(ID = key_1) %>%
arrange(ID, Time)
Output is:
Time ID X Y Z
1 1 A 2 4 5
2 2 A 2 3 2
3 3 A 4 4 4
4 1 B 6 1 2
5 2 B 3 2 3
6 3 B 4 4 4
Sample data:
df <- structure(list(Time = 1:3, A_X = c(2L, 2L, 4L), A_Y = c(4L, 3L,
4L), A_Z = c(5L, 2L, 4L), B_X = c(6L, 3L, 4L), B_Y = c(1L, 2L,
4L), B_Z = 2:4), .Names = c("Time", "A_X", "A_Y", "A_Z", "B_X",
"B_Y", "B_Z"), class = "data.frame", row.names = c(NA, -3L))
Here is another dplyr and tidyr solution.
df %>%
gather(ID, value, -Time) %>%
separate(ID, into = c("ID", "var")) %>%
spread(var, value) %>%
arrange(ID) %>%
select(ID, Time, X, Y, Z)
# ID Time X Y Z
# 1 A 1 2 4 5
# 2 A 2 2 3 2
# 3 A 3 4 4 4
# 4 B 1 6 1 2
# 5 B 2 3 2 3
# 6 B 3 4 4 4
currently i am trying to count frequency of set of sequence of data frame.
A B
1 a
1 b
1 c
2 a
2 b
2 c
i have this data frame and i would like to count frequency of "B" of another data frame looking like this
C D
1 a
1 a
1 b
1 b
2 b
2 c
2 c
As you can see the number of rows is different so datatable(counts) does not work. i would like to it to look like this after frequency count is done
a b freq
1 a 2
1 b 2
1 c 0
2 a 0
2 b 1
2 c 2
As you can see it makes counts of all the frequency even the 0 as the on some groups there is no data on it.
thanks for anyone that helps!
By using merge and aggregate
df2$freq = 1
df = merge(df1,aggregate(freq~.,df2,length),by.x = c('A','B'),by.y = c('C','D'),all.x = T)
df[is.na(df)] = 0
df
A B freq
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
More Info
aggregate(freq~.,df2,length)
C D freq
1 1 a 2
2 1 b 2
3 2 b 1
4 2 c 2
Data Input
df1
A B
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
df2
C D
1 1 a
2 1 a
3 1 b
4 1 b
5 2 b
6 2 c
7 2 c
This looks to be a question of how to tabulate frequencies across two factors without dropping missing levels.
Here's the dplyr solution. This assumes that dfAB, as in your example data, contains no duplicates (dfAB is interchangeable with the output of expand.grid if you don't already have the level combinations in a data frame)
library(dplyr)
dfAB %>%
# need at least one non-joining variable to tell matches from non-matches
left_join(mutate(dfCD, dummy = 1), by = c("A" = "C", "B" = "D")) %>%
group_by(A, B) %>%
summarize(freq = sum(dummy, na.rm = TRUE))
Output:
# A tibble: 6 x 3
# Groups: A [?]
A B freq
<dbl> <chr> <dbl>
1 1 a 2
2 1 b 2
3 1 c 0
4 2 a 0
5 2 b 1
6 2 c 2
(if there are duplicates in dfAB, add a distinct call to the chain before the join)
df1_rows = Reduce(paste, df1)
df2_rows = Reduce(paste, df2)
data.frame(df1, freq = sapply(df1_rows, function(x) sum(df2_rows %in% x)),
row.names = NULL)
# A B freq
#1 1 a 2
#2 1 b 2
#3 1 c 0
#4 2 a 0
#5 2 b 1
#6 2 c 2
DATA
df1 = data.frame(A = c(1L, 1L, 1L, 2L, 2L, 2L),
B = c("a", "b", "c", "a", "b", "c"))
df2 = data.frame(C = c(1L, 1L, 1L, 1L, 2L, 2L, 2L),
D = c("a", "a", "b", "b", "b", "c", "c"))