Is it possible to bind rows only where specific values are missing?
In this example I have a table with four ID's and some values. Every ID is suppose to have a corresponding value of 1-3. As you can see, some of these values are missing in table dat. To fix this I want to bind dat with dat2, but only where values from the column "value" are missing from table dat. How can I achieve this?
To be clear, I only want 12 rows in total. So for instance, ID 4 has the value 3 and cat_var "green" in table dat. By contrast, in table dat2 ID 4 has the value 3 and cat_var "red". This means that I don't want to bind that row, since there already exists a row for ID 4 and value 3 in table dat. I hope I'm making myself clear.
library(tidyverse)
Data:
id <- c(rep(1:4,3))
value <- c(rep(1:3, each = 4))
dat <- data.frame(id, value)
dat2 <- dat
dat <- dat %>%
slice(1, 3, 5, 6, 7, 8, 10, 12)
dat2$cat_var <- c(rep("orange", 5), rep("green", 5), rep("red", 2))
dat$cat_var <- c(rep("orange", 3), rep("green", 5))
Desired result:
# A tibble: 12 x 3
id value cat_var
<int> <int> <chr>
1 1 1 orange
2 2 1 orange
3 3 1 orange
4 4 1 orange
5 1 2 orange
6 2 2 green
7 3 2 green
8 4 2 green
9 1 3 green
10 2 3 green
11 3 3 red
12 4 3 green
dat %>% bind_rows(dat2) %>% distinct(id, value, .keep_all = T) %>%
arrange(value, id)
results in :
id value cat_var
1 1 1 orange
2 2 1 orange
3 3 1 orange
4 4 1 orange
5 1 2 orange
6 2 2 green
7 3 2 green
8 4 2 green
9 1 3 green
10 2 3 green
11 3 3 red
12 4 3 green
You dont need the arrange (it is just to get the exact same dataframe as the disered result).
Using base R :
row bind dat and dat2 and using duplicated keep unique rows.
result <- rbind(dat, dat2)
result <- result[!duplicated(result[, c('id', 'value')]), ]
result
# id value cat_var
#1 1 1 orange
#2 3 1 orange
#3 1 2 orange
#4 2 2 green
#5 3 2 green
#6 4 2 green
#7 2 3 green
#8 4 3 green
#10 2 1 orange
#12 4 1 orange
#17 1 3 green
#19 3 3 red
Related
I basically have a data frame with a column of letters and a column of colors:
x <- data.frame(col1=c("a","b","a","c","d","d","c","a","b","c"),
col2=c("red","orange","yellow","red","red","yellow","orange","yellow","red","orange"))
col1 col2
a red
b orange
a yellow
c red
d red
d yellow
c orange
a yellow
b red
c orange
My goal is to create a second data frame that counts the number of occurences of each color in col2 of x for each letter in col1. Basically:
Letters Occurences Red Orange Yellow
a 3 1 0 2
b 2 1 1 0
c 3 1 2 0
d 2 1 0 1
Right now, I just brute forced it since there are only 3 factors of col2. I used:
df <- data.frame(Letters = levels(factor(x$col1)))
df$Occurences <- table(x$col1)
df$red <- table(factor(x$col1[x$col2=="red"],levels=levels(factor(x$col1))))
df$orange <- table(factor(x$col1[x$col2=="orange"],levels=levels(factor(x$col1))))
df$yellow <- table(factor(x$col1[x$col2=="yellow"],levels=levels(factor(x$col1))))
Is there an easier way to do this, as opposed to doing each column of df one by one? Especially with a data set that has a lot more than 3 factors?
Use pivot_wider from tidyr
library(tidyr)
x %>%
pivot_wider(names_from = col2, values_from = col2, values_fn = "length", values_fill = 0)
Output:
# A tibble: 4 × 4
col1 red orange yellow
<chr> <int> <int> <int>
1 a 1 0 2
2 b 1 1 0
3 c 1 2 0
4 d 1 0 1
as.data.frame.matrix(addmargins(table(x), 2))
orange red yellow Sum
a 0 1 2 3
b 1 1 0 2
c 2 1 0 3
d 0 1 1 2
I have a list of data frames:
df1 <- data.frame(one = c('red','blue','green','red','red','blue','green','green'),
one.1 = as.numeric(c('1','1','0','1','1','0','0','0')))
df2 <- data.frame(two = c('red','yellow','green','yellow','green','blue','blue','red'),
two.2 = as.numeric(c('0','1','1','0','0','0','1','1')))
df3 <- data.frame(three = c('yellow','yellow','green','green','green','white','blue','white'),
three.3 = as.numeric(c('1','0','0','1','1','0','0','1')))
all <- list(df1,df2,df3)
I need to group each data frame by the first column and summarise the second column.
Individually I would do something like this:
library(dplyr)
df1 <- df1 %>%
group_by(one) %>%
summarise(sum = sum(one.1))
However I'm having trouble figuring out how to iterate over each item in the list.
I've thought of using a loop:
for(i in 1:3){
all[i] <- all[i] %>%
group_by_at(1) %>%
summarise()
}
But I can't figure out how to specify a column to sum in the summarise() function (this loop is likely wrong in other ways than that anyway).
Ideally I need the output to be another list with each item being the summarised data, like so:
[[1]]
# A tibble: 3 x 2
one sum
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two sum
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three sum
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Would really appreciate any help!
Using purrr::map and summarise at columns contain a letteral dot \\. using matches helper.
library(dplyr)
library(purrr)
map(all, ~.x %>%
#group_by_at(vars(matches('one$|two$|three$'))) %>% #column ends with one, two, or three
group_by_at(1) %>%
summarise_at(vars(matches('\\.')),sum))
#summarise_at(vars(matches('\\.')),list(sum=~sum))) #2nd option
[[1]]
# A tibble: 3 x 2
one one.1
<fct> <dbl>
1 blue 1
2 green 0
3 red 3
[[2]]
# A tibble: 4 x 2
two two.2
<fct> <dbl>
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
# A tibble: 4 x 2
three three.3
<fct> <dbl>
1 blue 0
2 green 2
3 white 1
4 yellow 1
Here's a base R solution:
lapply(all, function(DF) aggregate(list(added = DF[, 2]), by = DF[, 1, drop = F], FUN = sum))
[[1]]
one added
1 blue 1
2 green 0
3 red 3
[[2]]
two added
1 blue 1
2 green 1
3 red 1
4 yellow 1
[[3]]
three added
1 blue 0
2 green 2
3 white 1
4 yellow 1
Another approach would be to bind the lists into one. Here I use data.table and avoid using the names. The only problem is that this may mess up factors but I'm not sure that's an issue in your case.
library(data.table)
rbindlist(all, use.names = F, idcol = 'id'
)[, .(added = sum(one.1)), by = .(id, color = one)]
id color added
1: 1 red 3
2: 1 blue 1
3: 1 green 0
4: 2 red 1
5: 2 yellow 1
6: 2 green 1
7: 2 blue 1
8: 3 yellow 1
9: 3 green 2
10: 3 white 1
11: 3 blue 0
This is the shor example data. Original data has many columns and rows.
head(df, 15)
ID col1 col2
1 1 green yellow
2 1 green blue
3 1 green green
4 2 yellow blue
5 2 yellow yellow
6 2 yellow blue
7 3 yellow yellow
8 3 yellow yellow
9 3 yellow blue
10 4 blue yellow
11 4 blue yellow
12 4 blue yellow
13 5 yellow yellow
14 5 yellow blue
15 5 yellow yellow
what I want to count how many different colors in col2 including the color of col1. For ex: for the ID=4, there is only 1 color in col2. if we include col1, there are 2 different colors. So output should be 2 and so on.
I tried in this way, but it doesn't give me my desired output: ID = 4 turns into 0 which is not I want. So how could I tell R to count them including color in col1?
out <- df %>%
group_by(ID) %>%
mutate(N = ifelse(col1 != col2, 1, 0))
My desired output is something like this:
ID col1 count
1 green 3
2 yellow 2
3 yellow 2
4 blue 2
5 yellow 2
You can do:
df %>%
group_by(ID, col1) %>%
summarise(count = n_distinct(col2))
ID col1 count
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
Or even:
df %>%
group_by(ID, col1) %>%
summarise_all(n_distinct)
ID col1 col2
<int> <chr> <int>
1 1 green 3
2 2 yellow 2
3 3 yellow 2
4 4 blue 1
5 5 yellow 2
To group by every three rows:
df %>%
group_by(group = gl(n()/3, 3), col1) %>%
summarise(count = n_distinct(col2))
I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9
I'm new to R, and I'm pretty sure this is something simple to accomplish, but I cannot figure out how to perform this action. I've tried the split function, utilizing a for loop, but cannot quite figure out how to get it right. As an example, this is what my original data frame looks like:
dat <- data.frame(col1 = c(rep("red", 4), rep("blue", 3)), col2 = c(1, 3, 2, 4, 7, 8, 9))
col1 col2
red 1
red 3
red 2
red 4
blue 7
blue 8
blue 9
I want to create new columns for each unique value in col1 and assign it's corressponding value in col2 to the new data frame. And this is how I want my new data frame:
red blue
1 7
3 8
2 9
4 NA
I've gotten close with a list structure close to what I wanted, but I need a data frame to boxplot and dotplot the results. Any help would be appriciated. Thanks!
I'm sure there's a more efficient solution, but here's one option
dat <- data.frame(col1 = c(rep("red", 4), rep("blue", 3)), col2 = c(1, 3, 2, 4, 7, 8, 9))
dat
col1 col2
1 red 1
2 red 3
3 red 2
4 red 4
5 blue 7
6 blue 8
7 blue 9
ust <- unstack(dat, form = col2 ~ col1)
res <- data.frame(sapply(ust, '[', 1:max(unlist(lapply(ust, length)))))
res
blue red
1 7 1
2 8 3
3 9 2
4 NA 4
Edit: If you want the column order red then blue
res[, c("red", "blue")]
red blue
1 1 7
2 3 8
3 2 9
4 4 NA
Here's an Hadleyverse possible solution
library(tidyr)
library(dplyr)
dat %>%
group_by(col1) %>%
mutate(n = row_number()) %>%
spread(col1, col2)
# Source: local data frame [4 x 3]
#
# n blue red
# 1 1 7 1
# 2 2 8 3
# 3 3 9 2
# 4 4 NA 4
Or using data.table
library(data.table)
dcast(setDT(dat)[, indx := 1:.N, by = col1], indx ~ col1, value.var = "col2")
# indx blue red
# 1: 1 7 1
# 2: 2 8 3
# 3: 3 9 2
# 4: 4 NA 4
Just to show another option using base R *applyand cbind
# split the data into list using col1 column
tmp.list = lapply(split(dat, dat$col1), function(x) x$col2)
# identify the length of the biggest list
max.length = max(sapply(tmp.list, length))
# combine the list elements, while filling NA for the missing values
data.frame(do.call(cbind,
lapply(tmp.list, function(x) c(x, rep(NA, max.length - length(x))))
))
# blue red
#1 7 1
#2 8 3
#3 9 2
#4 NA 4