This question already has answers here:
Gather multiple sets of columns
(5 answers)
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I have a shopping cart data, which look like the sample dataframe below:
sample_df<-data.frame(
clientid=1:10,
ProductA=c("chair","table","plate","plate","table","chair","table","plate","chair","chair"),
QuantityA=c(1,2,1,1,1,1,2,3,1,2),
ProductB=c("table","doll","shoes","","door","","computer","computer","","plate"),
QuantityB=c(3,1,2,"",2,"",1,1,"",1)
)
#sample data frame
clientid ProductA QuantityA ProductB QuantityB
1 1 chair 1 table 3
2 2 table 2 doll 1
3 3 plate 1 shoes 2
4 4 plate 1
...
10 10 chair 2 plate 1
I would like to transform it into different format, which will be like:
#ideal data frame
clientid ProductNumber Product Quantity
1 1 A chair 1
2 1 B table 3
3 2 A table 2
4 2 B doll 1
...
11 6 A chair 1
...
17 10 A chair 2
18 10 B plate 1
I have tried
library(tidyr)
sample_df_gather<- sample_df %>% select(clientid, ProductA, ProductB)
%>% gather(ProductNumber, value, -clientid) %>% filter(!is.na(value))
#this gives me
clientid ProductNumber value
1 1 ProductA chair
2 2 ProductB table
3 3 ProductA plate
4 4 ProductB plate
...
However, I don't know how to add Quantity to the data frame. Also, in the actual data frame, there are more columns such as Title, price, which I would like to transform into the ideal data frame as well. Is there a way to transform the data into the ideal format?
With data.table:
library(data.table)
res = melt(setDT(sample_df),
measure.vars = patterns("^Product", "^Quantity"),
variable.name = "ProductNumber")
res[, ProductNumber := factor(ProductNumber, labels = c("A","B"))]
which gives
clientid ProductNumber value1 value2
1: 1 A chair 1
2: 2 A table 2
3: 3 A plate 1
4: 4 A plate 1
5: 5 A table 1
6: 6 A chair 1
7: 7 A table 2
8: 8 A plate 3
9: 9 A chair 1
10: 10 A chair 2
11: 1 B table 3
12: 2 B doll 1
13: 3 B shoes 2
14: 4 B NA NA
15: 5 B door 2
16: 6 B NA NA
17: 7 B computer 1
18: 8 B computer 1
19: 9 B NA NA
20: 10 B plate 1
Data (since the OP's original data was borked):
structure(list(clientid = 1:10, ProductA = structure(c(1L, 3L,
2L, 2L, 3L, 1L, 3L, 2L, 1L, 1L), .Label = c("chair", "plate",
"table"), class = "factor"), QuantityA = c(1L, 2L, 1L, 1L, 1L,
1L, 2L, 3L, 1L, 2L), ProductB = structure(c(6L, 2L, 5L, NA, 3L,
NA, 1L, 1L, NA, 4L), .Label = c("computer", "doll", "door", "plate",
"shoes", "table"), class = "factor"), QuantityB = c(3L, 1L, 2L,
NA, 2L, NA, 1L, 1L, NA, 1L)), .Names = c("clientid", "ProductA",
"QuantityA", "ProductB", "QuantityB"), row.names = c(NA, -10L
), class = "data.frame")
Related
I am relatively new to R but slowly finding my way. I encountered a problem, however, and hope someone can help me.
Let's say I two dataframes (lets call them A and B), both containing survey responses. A contains all responses from the first set of people. B contains the responses of the second set of people, plus the people of the first set but with their responses set to NA. An example:
Dataframe A:
Household Individual Answer_A Answer_b
1 2 5 6
1 3 6 6
2 1 2 3
Dataframe B:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 NA NA
1 3 NA NA
2 1 NA NA
2 2 4 7
I want to get one dataframe with all individuals and their responses:
Dataframe C:
Household Individual Answer_A Answer_b
1 1 3 6
1 2 5 6
1 3 6 6
2 1 2 3
2 2 4 7
If I only have two datasets I can use rbind.fill, with rbind.fill(B, A) to get dataframe C, as then the NAs in B are overwritten with answers in A.
But... if I would have to add a third dataset, D, that would consist of NAs for people in A and B, I would not be able to use this solution. What would I be able to do then? I've looked at intersect, outersect, different forms of join, but can't seem to think of a good solution.
Any thoughts?
Maybe you can left_join and then use coalesce
library(dplyr)
left_join(B, A, by = c("Household", "Individual")) %>%
mutate(Answer_A = coalesce(Answer_A.x, Answer_A.y),
Answer_B = coalesce(Answer_b.x, Answer_b.y)) %>%
select(-matches("\\.x|\\.y"))
# Household Individual Answer_A Answer_B
#1 1 1 3 6
#2 1 2 5 6
#3 1 3 6 6
#4 2 1 2 3
#5 2 2 4 7
data
A <- structure(list(Household = c(1L, 1L, 2L), Individual = c(2L,
3L, 1L), Answer_A = c(5L, 6L, 2L), Answer_b = c(6L, 6L, 3L)), class = "data.frame",
row.names = c(NA, -3L))
B <- structure(list(Household = c(1L, 1L, 1L, 2L, 2L), Individual = c(1L,
2L, 3L, 1L, 2L), Answer_A = c(3L, NA, NA, NA, 4L), Answer_b = c(6L,
NA, NA, NA, 7L)), class = "data.frame", row.names = c(NA, -5L))
This question already has answers here:
Convert data from long format to wide format with multiple measure columns
(6 answers)
Closed 4 years ago.
I want to reshape a long dataframe to a wide one. That is, I want to go from this:
file label val1 val2
1 red A 12 3
2 red B 4 2
3 red C 5 8
4 green A 3 3
5 green B 6 5
6 green C 9 6
7 blue A 3 3
8 blue B 1 2
9 blue C 4 6
to this:
file value1_A value1_B value1_C value2_A value2_B value2_C
1 red 12 4 5 3 2 8
2 green 3 6 9 3 5 6
3 blue 3 1 4 3 2 6
My best attempt thus far is as follows:
library(tidyverse)
dat <-
structure(list(file = structure(c(3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L),
.Label = c("blue", "green", "red"),
class = "factor"),
label = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
.Label = c("A", "B", "C"),
class = "factor"),
val1 = c(12L, 4L, 5L, 3L, 6L, 9L, 3L, 1L, 4L),
val2 = c(3L, 2L, 8L, 3L, 5L, 6L, 3L, 2L, 6L)),
class = "data.frame", row.names = c(NA, -9L))
dat %>%
group_by(file) %>%
mutate(values1 = paste('value1', label, sep='_'),
values2 = paste('value2', label, sep='_')) %>%
spread(values1, val1) %>%
spread(values2, val2) %>%
select(-label)
# # A tibble: 9 x 7
# # Groups: file [3]
# file value1_A value1_B value1_C value2_A value2_B value2_C
# <fct> <int> <int> <int> <int> <int> <int>
# 1 blue 3 NA NA 3 NA NA
# 2 blue NA 1 NA NA 2 NA
# 3 blue NA NA 4 NA NA 6
# 4 green 3 NA NA 3 NA NA
# 5 green NA 6 NA NA 5 NA
# 6 green NA NA 9 NA NA 6
# 7 red 12 NA NA 3 NA NA
# 8 red NA 4 NA NA 2 NA
# 9 red NA NA 5 NA NA 8
The output is unsatisfactory since what should be on one row occupies three, with multiple 'NA'. This seems to be due to using spread twice, but I don't know how else to achieve the result I desire. I'd very much appreciate any advice on how to do this.
Many thanks in advance,
-R
Here's a way
library(tidyverse)
dat %>%
# first move to long form so we can
# see the original column names as strings
gather("variable_name", "value", contains("val")) %>%
# create the new column names from the variable name and the label
mutate(new_column_name = paste(variable_name, label, sep="_")) %>%
# get rid of the pieces we used to make the column names
select(-label, -variable_name) %>%
# now spread
spread(new_column_name, value)
here's the data.table way. all in one line of code...
library( data.table )
dcast( setDT( dat ), file ~ label, value.var = c("val1", "val2"))
# file val1_A val1_B val1_C val2_A val2_B val2_C
# 1: blue 3 1 4 3 2 6
# 2: green 3 6 9 3 5 6
# 3: red 12 4 5 3 2 8
Within a group, I want to find the difference between that row and the first time that user appeared in the data. For example, I need to create the diff variable below. Users have different number of rows each as in the following data:
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L), diff = c(NA, 3L, 4L,
6L, NA, 2L, 3L, NA, NA, 8L)), .Names = c("ID", "money", "occurence",
"diff"), class = "data.frame", row.names = c(NA, -10L))
ID money occurence diff
1 1 9 1 NA
2 1 12 2 3
3 1 13 3 4
4 1 15 4 6
5 2 5 1 NA
6 2 7 2 2
7 2 8 3 3
8 3 5 1 NA
9 4 2 1 NA
10 4 10 2 8
You can use ave(). We just remove the first value per group and replace it with NA, and subtract the first value from the rest of the values.
with(df, ave(money, ID, FUN = function(x) c(NA, x[-1] - x[1])))
# [1] NA 3 4 6 NA 2 3 NA NA 8
A dplyr solution, which uses the first function to get the first value and calculate the difference.
library(dplyr)
df2 <- df %>%
group_by(ID) %>%
mutate(diff = money - first(money)) %>%
mutate(diff = replace(diff, diff == 0, NA)) %>%
ungroup()
df2
# # A tibble: 10 x 4
# ID money occurence diff
# <int> <int> <int> <int>
# 1 1 9 1 NA
# 2 1 12 2 3
# 3 1 13 3 4
# 4 1 15 4 6
# 5 2 5 1 NA
# 6 2 7 2 2
# 7 2 8 3 3
# 8 3 5 1 NA
# 9 4 2 1 NA
# 10 4 10 2 8
Update
Here is a data.table solution provided by Sotos. Notice that no need to replace 0 with NA.
library(data.table)
setDT(df)[, money := money - first(money), by = ID][]
# ID money occurence diff
# 1: 1 0 1 NA
# 2: 1 3 2 3
# 3: 1 4 3 4
# 4: 1 6 4 6
# 5: 2 0 1 NA
# 6: 2 2 2 2
# 7: 2 3 3 3
# 8: 3 0 1 NA
# 9: 4 0 1 NA
# 10: 4 8 2 8
DATA
dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "money",
"occurence"), row.names = c(NA, -10L), class = "data.frame")
I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))
I have the following data table
PIECE SAMPLE QC_CODE
1 1 1
2 1 NA
3 2 2
4 2 4
5 2 NA
6 3 6
7 3 3
8 3 NA
9 4 6
10 4 NA
and I would like to count the number of qc_code in each sample and return an output like this
SAMPLE SAMPLE_SIZE QC_CODE_COUNT
1 2 1
2 3 2
3 3 2
4 2 1
Where sample size is the count of pieces in each sample, and qc_code_count is the count of al qc_code that are no NA.
How would I go about this in R
You can try
library(dplyr)
df1 %>%
group_by(SAMPLE) %>%
summarise(SAMPLE_SIZE=n(), QC_CODE_UNIT= sum(!is.na(QC_CODE)))
# SAMPLE SAMPLE_SIZE QC_CODE_UNIT
#1 1 2 1
#2 2 3 2
#3 3 3 2
#4 4 2 1
Or
library(data.table)
setDT(df1)[,list(SAMPLE_SIZE=.N, QC_CODE_UNIT=sum(!is.na(QC_CODE))), by=SAMPLE]
Or using aggregate from base R
do.call(data.frame,aggregate(QC_CODE~SAMPLE, df1, na.action=NULL,
FUN=function(x) c(SAMPLE_SIZE=length(x), QC_CODE_UNIT= sum(!is.na(x)))))
data
df1 <- structure(list(PIECE = 1:10, SAMPLE = c(1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L), QC_CODE = c(1L, NA, 2L, 4L, NA, 6L, 3L, NA,
6L, NA)), .Names = c("PIECE", "SAMPLE", "QC_CODE"), class = "data.frame",
row.names = c(NA, -10L))