Two step dataframe rearrange in R - r

I import a csv into a dataframe with this structure:
id brand p_1 p_2 p_3 p_4 p_5
1 A 1 2 5
2 B 2 3
3 C 3
4 B 1
5 A 2
And I would like to first get it into this structure
p A B C
1 1 1 0
2 2 1 0
3 0 1 1
4 0 0 0
5 1 0 0
So it counts all combinations of values BUT is also counts non existing ones such as 4 that does no appear YET is a value among 1 (min) and 5 (max), and this is the tricky part!
Thanks!

df %>%
pivot_longer(-(1:2)) %>%
filter(!is.na(value)) %>%
count(value, brand) %>%
complete(value = 1:5, brand, fill = list(n=0)) %>%
pivot_wider(names_from = brand, values_from = n, values_fill = 0)
result
# A tibble: 5 × 4
value A B C
<int> <int> <int> <int>
1 1 1 1 0
2 2 2 1 0
3 3 0 1 1
4 4 0 0 0
5 5 1 0 0
source data
df <- data.frame(
stringsAsFactors = FALSE,
id = c(1L, 2L, 3L, 4L, 5L),
brand = c("A", "B", "C", "B", "A"),
p_1 = c(1L, 2L, 3L, 1L, 2L),
p_2 = c(2L, 3L, NA, NA, NA),
p_3 = c(5L, NA, NA, NA, NA),
p_4 = c(NA, NA, NA, NA, NA),
p_5 = c(NA, NA, NA, NA, NA)
)

Related

how to generate a new variable by one column's value overriding the other's in R

I have a dataset that is essential in the following format:
group
var1
var2
var3
a
1
.
.
a
1
.
.
a
1
2
.
a
1
2
3
a
1
.
.
b
1
.
.
b
1
2
3
b
1
2
.
b
1
2
3
b
1
2
.
I want to generate a new variable that in this format:
group
var1
var2
var3
new var
a
1
.
.
1
a
1
.
.
1
a
1
2
.
2
a
1
2
3
3
a
1
.
.
3
b
1
.
.
1
b
1
2
3
3
b
1
2
.
3
b
1
2
3
3
b
1
2
.
3
Help pls?
Here is an option with pmax and cummax (assuming the . are missing -NA). Grouped by 'group', invoke pmax across the columns that 'starts_with' 'var' in column names, and get the cumulative max (cummax)
library(dplyr)
library(purrr)
df1 %>%
group_by(group) %>%
mutate(newvar = cummax(invoke(pmax,
c(across(starts_with('var')), na.rm = TRUE)))) %>%
ungroup
-output
# A tibble: 10 × 5
group var1 var2 var3 newvar
<chr> <int> <int> <int> <int>
1 a 1 NA NA 1
2 a 1 NA NA 1
3 a 1 2 NA 2
4 a 1 2 3 3
5 a 1 NA NA 3
6 b 1 NA NA 1
7 b 1 2 3 3
8 b 1 2 NA 3
9 b 1 2 3 3
10 b 1 2 NA 3
data
df1 <- structure(list(group = c("a", "a", "a", "a", "a", "b", "b", "b",
"b", "b"), var1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
var2 = c(NA, NA, 2L, 2L, NA, NA, 2L, 2L, 2L, 2L), var3 = c(NA,
NA, NA, 3L, NA, NA, 3L, NA, 3L, NA)), row.names = c(NA, -10L
), class = "data.frame")
See if this helps you out
lastValue <- function(x) tail(x[!is.na(x)], 1)
df$newvar <- apply(df, 1, lastValue)

R: making a hard-to-describe transformation to a data frame

Right now I have a dataset that roughly looks like this:
Id Eng_ver_1 Eng_ver_2 Bio_ver_1 Bio_ver_2 Subject Version
1 NA 1 NA NA Eng 2
2 NA NA NA 1 Bio 2
3 NA NA 1 NA Bio 1
4 1 NA NA NA Eng 1
The columns represent conditions that participants go through. Because each person only goes through one condition, it is guaranteed that in every row only 1 of the 4 columns has a value. Instead of looking like this, it is easier to do analysis in my case if the data were to look like this:
Id Subject Version Score
1 English 2 1
2 Biology 2 1
3 Biology 1 1
4 English 1 1
Is there any quick way of doing this transformation? In other words, how do I get rid of all the NAs and shrink the 4 columns into 1.
Additionally, What if instead of 4 columns, I have 40 columns, with each Id only having data in 10 out of those 40 columns?
Since you'll have data in only one column in each row I think using rowSums as suggested by #alistaire would be easy and quick solution.
You can also get data in long format with pivot_longer in tidyr :
library(dplyr)
df %>%
tidyr::pivot_longer(cols = matches('.*_ver_\\d+'),
values_drop_na = TRUE, values_to = 'score') %>%
select(-name)
# A tibble: 4 x 4
# Id Subject Version score
# <int> <chr> <int> <int>
#1 1 Eng 2 1
#2 2 Bio 2 1
#3 3 Bio 1 1
#4 4 Eng 1 1
data
df <- structure(list(Id = 1:4, Eng_ver_1 = c(NA, NA, NA, 1L), Eng_ver_2 = c(1L,
NA, NA, NA), Bio_ver_1 = c(NA, NA, 1L, NA), Bio_ver_2 = c(NA,
1L, NA, NA), Subject = c("Eng", "Bio", "Bio", "Eng"), Version = c(2L,
2L, 1L, 1L)), class = "data.frame", row.names = c(NA, -4L))
We can use coalesce from dplyr
library(dplyr)
df %>%
transmute(Id, Subject, Version,
score = coalesce(!!! select(., contains('ver'))))
# Id Subject Version score
#1 1 Eng 2 1
#2 2 Bio 2 1
#3 3 Bio 1 1
#4 4 Eng 1 1
data
df <- structure(list(Id = 1:4, Eng_ver_1 = c(NA, NA, NA, 1L), Eng_ver_2 = c(1L,
NA, NA, NA), Bio_ver_1 = c(NA, NA, 1L, NA), Bio_ver_2 = c(NA,
1L, NA, NA), Subject = c("Eng", "Bio", "Bio", "Eng"), Version = c(2L,
2L, 1L, 1L)), class = "data.frame", row.names = c(NA, -4L))

Combine multiple rows by group, and list keep the first value in each column that isn't NA, in R

My data.frame looks like
ID Encounter Value1 Value2
1 A 1 NA
1 A 2 10
1 B NA 20
1 B 4 30
2 A 5 40
2 A 6 50
2 B NA NA
2 B 7 60
and I want it to look like
ID Encounter Value1 Value2
1 A 1 10
1 B 4 20
2 A 5 40
2 B 7 60
We can use dplyr. Grouped by 'ID', 'Encounter', get the first value that is not an NA (!is.na(.)) in the rest of the column. By any chane, if all the values are NA, then return the NA
library(dplyr)
df1 %>%
group_by(ID, Encounter) %>%
summarise_at(vars(-group_cols()), ~ if(all(is.na(.))) NA_integer_
else first(.[!is.na(.)]))
# A tibble: 4 x 4
# Groups: ID [2]
# ID Encounter Value1 Value2
# <int> <chr> <int> <int>
#1 1 A 1 10
#2 1 B 4 20
#3 2 A 5 40
#4 2 B 7 60
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
Encounter = c("A",
"A", "B", "B", "A", "A", "B", "B"), Value1 = c(1L, 2L, NA, 4L,
5L, 6L, NA, 7L), Value2 = c(NA, 10L, 20L, 30L, 40L, 50L, NA,
60L)), class = "data.frame", row.names = c(NA, -8L))

How to create multiple columns in r dataframe by implementing some query conditions

I have a dataset which is similar to the following:
Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3
6-9 a b a 2 3 4
6-9 b b c 1 2 3
6-9 c a 4 1
9-10 c c b 1 3 1
9-10 c a b 1 2 1
Using R, I want to get the following data set which contains new set of columns a, b and c by adding the corresponding values:
Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3 a b c
6-9 a b a 2 3 4 6 3 0
6-9 b b c 1 2 3 0 3 3
6-9 c a 4 1 1 0 4
9-10 c c b 1 3 1 0 1 4
9-10 c a b 1 2 1 2 1 1
Note: My data also contains missing values. The variables Monday:Wednesday are factors and the variables Value1:Value3 are numeric. For more clearity: 1st row of column "a" contains the addition of all values through Value1 to Value3 related to a (say for example 2+4 =6).
One way using base R:
data$id <- 1:nrow(data) # Create a unique id
vlist <- list(grep("day$", names(data)), grep("^Value", names(data)))
d1 <- reshape(data, direction="long", varying=vlist, v.names=c("Day","Value"))
d2 <- aggregate(Value~id+Day, FUN=sum, na.rm=TRUE, data=d1)
d3 <- reshape(d2, direction="wide", v.names="Value", timevar="Day")
d3[is.na(d3)] <- 0
merge(data, d3, by="id", all.x=TRUE)
# id Age Monday Tuesday Wednesday Value1 Value2 Value3 Value.a Value.b Value.c
#1 1 6-9 a b a 2 3 4 6 3 0
#2 2 6-9 b b c 1 2 3 0 3 3
#3 3 6-9 <NA> c a NA 4 1 1 0 4
#4 4 9-10 c c b 1 3 1 0 1 4
#5 5 9-10 c a b 1 2 1 2 1 1
Data:
data <- structure(list(Age = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("6-9",
"9-10"), class = "factor"), Monday = structure(c(1L, 2L, NA,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), Tuesday = structure(c(2L,
2L, 3L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
Wednesday = structure(c(1L, 3L, 1L, 2L, 2L), .Label = c("a",
"b", "c"), class = "factor"), Value1 = c(2L, 1L, NA, 1L,
1L), Value2 = c(3L, 2L, 4L, 3L, 2L), Value3 = c(4L, 3L, 1L,
1L, 1L)), class = "data.frame", row.names = c(NA, -5L))
You can use below code:
data[] <- lapply(data, as.character)
data$rownumber<-rownames(data)
x<-gather(data[,c(1:4,8)], Day, Letter, Monday:Wednesday) %>% mutate(row2 = rownames(x))
y<-gather(data[,c(1,5:7,8)], Day, Value, Value1:Value3)%>% mutate(row2 = rownames(y))
z<-left_join(x, y, by =c("Age","rownumber", "row2")) %>% group_by(Age, rownumber, Letter) %>% dplyr::summarise(suma = sum(as.numeric(Value), na.rm = T)) %>% mutate(suma = replace_na(suma, 0))
z<-dcast(z, rownumber ~ Letter , value.var="suma") %>% left_join(data, z, by = "rownumber")
z$Var.2<-NULL
z[is.na(z)]<-0
Output:
rownumber a b c Age Monday Tuesday Wednesday Value1 Value2 Value3
1 1 6 3 0 6-9 a b a 2 3 4
2 2 0 3 3 6-9 b b c 1 2 3
3 3 1 0 4 6-9 c a 0 4 1
4 4 0 1 4 9-10 c c b 1 3 1
5 5 2 1 1 9-10 c a b 1 2 1

Merge 2 data frames by row and column overlap

I would like to merge 2 data frames additively such that
taxonomy A B C
1 rat 0 1 2
2 dog 1 2 3
3 cat 2 3 0
and
taxonomy A D C
1 rat 0 1 9
2 Horse 0 2 6
3 cat 2 0 2
produce
taxonomy A B C D
1 rat 0 1 11 1
2 Horse 0 0 6 2
3 cat 4 3 2 0
4 dog 1 2 3 0
I've tried aggregate, merge, apply, ddply.... with no success...this will be done on 2 data frames with a couple hundred rows and columns
With bind_rows from dplyr:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(taxonomy) %>%
summarize_all(sum, na.rm = TRUE)
Output:
# A tibble: 4 x 5
taxonomy A B C D
<chr> <int> <int> <int> <int>
1 cat 4 3 2 0
2 dog 1 2 3 0
3 Horse 0 0 6 2
4 rat 0 1 11 1
Data:
df1 <- structure(list(taxonomy = c("rat", "dog", "cat"), A = 0:2, B = 1:3,
C = c(2L, 3L, 0L)), .Names = c("taxonomy", "A", "B", "C"), class = "data.frame", row.names = c("1",
"2", "3"))
df2 <- structure(list(taxonomy = c("rat", "Horse", "cat"), A = c(0L,
0L, 2L), D = c(1L, 2L, 0L), C = c(9L, 6L, 2L)), .Names = c("taxonomy",
"A", "D", "C"), class = "data.frame", row.names = c("1", "2",
"3"))
The data.table equivalent of #avid_useR's answer.
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), by = taxonomy]
# taxonomy A B C D
#1: rat 0 1 11 1
#2: dog 1 2 3 0
#3: cat 4 3 2 0
#4: Horse 0 0 6 2
You can do...
> library(reshape2)
> dcast(rbind(melt(DF1), melt(DF2)), taxonomy ~ variable, fun.aggregate = sum)
Using taxonomy as id variables
Using taxonomy as id variables
taxonomy A B C D
1 cat 4 3 2 0
2 dog 1 2 3 0
3 Horse 0 0 6 2
4 rat 0 1 11 1
This sorts the rows and columns alphabetically, but I guess this might be avoidable by using a factor.
Data:
DF1 = structure(list(taxonomy = c("rat", "dog", "cat"), A = 0:2, B = 1:3,
C = c(2L, 3L, 0L)), .Names = c("taxonomy", "A", "B", "C"), row.names = c(NA,
-3L), class = "data.frame")
DF2 = structure(list(taxonomy = c("rat", "Horse", "cat"), A = c(0L,
0L, 2L), D = c(1L, 2L, 0L), C = c(9L, 6L, 2L)), .Names = c("taxonomy",
"A", "D", "C"), row.names = c(NA, -3L), class = "data.frame")

Resources