Q About Converting the Format of Data Frame in R - r

May I know how to convert the format of this data frame? This is a participant who took three tests (A, B, C) two times (0,2) on two words (Word_id: 201, 202), with the scores on each time coded as 0 or 1.
I would like to covert my data frame like this, with "Time" occurring as "0, 0,0, 2, 2, 2".
Participant Time Measure Word_ID Score
100 0 A 201 0
100 0 B 201 1
100 0 C 201 0
100 2 A 201 1
100 2 B 201 1
100 2 C 201 1
100 0 A 202 0
100 0 B 202 0
100 0 C 202 0
100 2 A 202 1
100 2 B 202 1
100 2 C 202 1
But my current data frame looks like this. May I have your suggestions? Thank you very much.
Participant Time Measure 201 202
100 0 A 0 0
100 0 B 1 0
100 0 C 0 0
100 2 A 1 1
100 2 B 1 1
100 2 C 1 1

Reading your data as df like
df <- read.table(text = " Participant Time Measure 201 202
100 0 A 0 0
100 0 B 1 0
100 0 C 0 0
100 2 A 1 1
100 2 B 1 1
100 2 C 1 1", header = T)
In this case, column name 201 and 202 become X201 and X202.
library(dplyr)
library(stringr)
library(reshape2)
df %>%
reshape2::melt(id = c('Participant', 'Time', 'Measure'),
variable.name = "Word_ID",
value.name = "Score") %>%
mutate(Word_ID = str_remove(Word_ID, "X"))
Participant Time Measure Word_ID Score
1 100 0 A 201 0
2 100 0 B 201 1
3 100 0 C 201 0
4 100 2 A 201 1
5 100 2 B 201 1
6 100 2 C 201 1
7 100 0 A 202 0
8 100 0 B 202 0
9 100 0 C 202 0
10 100 2 A 202 1
11 100 2 B 202 1
12 100 2 C 202 1

You can use pivot_longer from tidyr:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(`201`:`202`, names_to = "Word_ID", values_to = "Score") %>%
arrange(Participant, Word_ID)
Output
Participant Time Measure Word_ID Score
<int> <int> <chr> <chr> <int>
1 100 0 A 201 0
2 100 0 B 201 1
3 100 0 C 201 0
4 100 2 A 201 1
5 100 2 B 201 1
6 100 2 C 201 1
7 100 0 A 202 0
8 100 0 B 202 0
9 100 0 C 202 0
10 100 2 A 202 1
11 100 2 B 202 1
12 100 2 C 202 1
Data
df <- structure(list(Participant = c(100L, 100L, 100L, 100L, 100L,
100L), Time = c(0L, 0L, 0L, 2L, 2L, 2L), Measure = c("A", "B",
"C", "A", "B", "C"), `201` = c(0L, 1L, 0L, 1L, 1L, 1L), `202` = c(0L,
0L, 0L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-6L))

Related

Conditional replacement of values in a column

I have the following:
ID Value1 Value2 Code
0001 3.3 432 A
0001 0 654 A
0001 0 63 A
0002 0 78 B
0002 1 98 B
0003 0 22 C
0003 0 65 C
0003 0 91 C
I need the following:
ID Value1 Value2 Code
0001 3.3 432 A
0001 0 0 A
0001 0 0 A
0002 0 0 B
0002 1 98 B
0003 0 22 C
0003 0 65 C
0003 0 91 C
i.e., for the same "Code" if there is at least one row with Value1 !=0 then all the other rows referred to the same Code will be set to 0 (meaning that 654 and 63 for 0001 relative to Value2 will be set to 0). If this is not the case (like for 0003 nothing will be done).
Can anyone help me please?
Thank you in advance
dplyr
library(dplyr)
quux %>%
group_by(Code) %>%
mutate(Value2 = if_else(abs(Value1) > 0 | !any(abs(Value1) > 0),
Value2, 0L)) %>%
ungroup()
# # A tibble: 8 x 4
# ID Value1 Value2 Code
# <int> <dbl> <int> <chr>
# 1 1 3.3 432 A
# 2 1 0 0 A
# 3 1 0 0 A
# 4 2 0 0 B
# 5 2 1 98 B
# 6 3 0 22 C
# 7 3 0 65 C
# 8 3 0 91 C
base R
quux |>
transform(Value2 = ifelse(ave(abs(Value1), Code, FUN = function(v) abs(v) > 0 | !any(abs(v) > 0)),
Value2, 0L))
# ID Value1 Value2 Code
# 1 1 3.3 432 A
# 2 1 0.0 0 A
# 3 1 0.0 0 A
# 4 2 0.0 0 B
# 5 2 1.0 98 B
# 6 3 0.0 22 C
# 7 3 0.0 65 C
# 8 3 0.0 91 C
data.table
library(data.table)
as.data.table(quux)[, Value2 := fifelse(abs(Value1) > 0 | !any(abs(Value1) > 0), Value2, 0L), by = Code][]
# ID Value1 Value2 Code
# <int> <num> <int> <char>
# 1: 1 3.3 432 A
# 2: 1 0.0 0 A
# 3: 1 0.0 0 A
# 4: 2 0.0 0 B
# 5: 2 1.0 98 B
# 6: 3 0.0 22 C
# 7: 3 0.0 65 C
# 8: 3 0.0 91 C
Data
quux <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Value1 = c(3.3, 0, 0, 0, 1, 0, 0, 0), Value2 = c(432L, 654L, 63L, 78L, 98L, 22L, 65L, 91L), Code = c("A", "A", "A", "B", "B", "C", "C", "C")), class = "data.frame", row.names = c(NA, -8L))
This should do it:
df %>% group_by(Code) %>%
mutate(Value2 = if_else(row_number() == 1 & any(Value1 != 0), Value2, 0))
# A tibble: 8 × 4
# Groups: Code [3]
# ID Value1 Value2 Code
# <int> <dbl> <dbl> <fct>
# 1 1 3.3 432 A
# 2 1 0 0 A
# 3 1 0 0 A
# 4 2 0 78 B
# 5 2 1 0 B
# 6 3 0 0 C
# 7 3 0 0 C
# 8 3 0 0 C
We can use an if_else here. For example
library(dplyr)
dd %>%
group_by(ID) %>%
mutate(Value2=if_else(any(Value1!=0) & Value1==0, 0L, Value2))
Basically we use any() to check for non-zero values and then replace with 0s if one is found.

Count with conditions in R dataframe

I have the following DF:
Week SKU Discount(%)
1 111 5
2 111 5
3 111 0
4 111 10
1 222 0
2 222 10
3 222 15
4 222 20
1 333 5
2 333 0
3 333 0
I would like to have this outcome:
Week SKU Discount(%) Duration LastDiscount
1 111 5 2 0
2 111 5 2 0
3 111 0 0 0
4 111 10 1 2
1 222 0 0 0
2 222 10 3 0
3 222 15 3 0
4 222 20 3 0
1 333 5 1 0
2 333 0 0 0
3 333 0 0 0
Duration is the number of weeks that 1 SKU had discounts continuously.
LastDiscount counts the number of weeks from the last time the SKU was on a continuous discount, only if there are weeks with 0 in between discounts.
One option to check the "Duration' is after grouping by 'SKU', use rle (run-length-encoding) on a logical vector, gets the lengths and 'values' and replicate those duration. Similarly, the "LastDiscount" can be obtained by getting the sum of logical values
library(dplyr)
df1 %>%
group_by(SKU) %>%
mutate(Duration = with(rle(Discount > 0), rep(lengths*values,
lengths)),
temp = with(rle(Discount > 0), sum(values != 0)),
LastDiscount = if(temp[1] > 1) c(rep(0, n()-1), temp[1]) else 0) %>%
select(-temp)
# A tibble: 11 x 5
# Groups: SKU [3]
# Week SKU Discount Duration LastDiscount
# <int> <int> <int> <int> <dbl>
# 1 1 111 5 2 0
# 2 2 111 5 2 0
# 3 3 111 0 0 0
# 4 4 111 10 1 2
# 5 1 222 0 0 0
# 6 2 222 10 3 0
# 7 3 222 15 3 0
# 8 4 222 20 3 0
# 9 1 333 5 1 0
#10 2 333 0 0 0
#11 3 333 0 0 0
Or using data.table
library(data.table)
i1 <- setDT(df1)[, grp := rleid(Discount > 0), SKU][Discount > 0,
Duration := .N, .(grp, SKU)][,
LastDiscount := uniqueN(grp[Discount > 0]), .(SKU)][,
tail(.I[Discount > 0 & LastDiscount > 1], 1), SKU]$V1
df1[-i1, LastDiscount := 0][]
data
df1 <- structure(list(Week = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L), SKU = c(111L, 111L, 111L, 111L, 222L, 222L, 222L, 222L,
333L, 333L, 333L), Discount = c(5L, 5L, 0L, 10L, 0L, 10L, 15L,
20L, 5L, 0L, 0L)), class = "data.frame", row.names = c(NA, -11L
))

R - Insert Missing Numbers in A Sequence by Group's Max Value

I'd like to insert missing numbers in the index column following these two conditions:
Partitioned by multiple columns
The minimum value is always 1
The maximum value is always the maximum for the group and type
Current Data:
group type index vol
A 1 1 200
A 1 2 244
A 1 5 33
A 2 2 66
A 2 3 2
A 2 4 199
A 2 10 319
B 1 4 290
B 1 5 188
B 1 6 573
B 1 9 122
Desired Data:
group type index vol
A 1 1 200
A 1 2 244
A 1 3 0
A 1 4 0
A 1 5 33
A 2 1 0
A 2 2 66
A 2 3 2
A 2 4 199
A 2 5 0
A 2 6 0
A 2 7 0
A 2 8 0
A 2 9 0
A 2 10 319
B 1 1 0
B 1 2 0
B 1 3 0
B 1 4 290
B 1 5 188
B 1 6 573
B 1 7 0
B 1 8 0
B 1 9 122
I've just added in spaces between the partitions for clarity.
Hope you can help out!
You can do the following
library(dplyr)
library(tidyr)
my_df %>%
group_by(group, type) %>%
complete(index = 1:max(index), fill = list(vol = 0))
# group type index vol
# 1 A 1 1 200
# 2 A 1 2 244
# 3 A 1 3 0
# 4 A 1 4 0
# 5 A 1 5 33
# 6 A 2 1 0
# 7 A 2 2 66
# 8 A 2 3 2
# 9 A 2 4 199
# 10 A 2 5 0
# 11 A 2 6 0
# 12 A 2 7 0
# 13 A 2 8 0
# 14 A 2 9 0
# 15 A 2 10 319
# 16 B 1 1 0
# 17 B 1 2 0
# 18 B 1 3 0
# 19 B 1 4 290
# 20 B 1 5 188
# 21 B 1 6 573
# 22 B 1 7 0
# 23 B 1 8 0
# 24 B 1 9 122
With group_by you specify the groups you indicated withed the white spaces. With complete you specify which columns should be complete and then what values should be filled in for the remaining column (default would be NA)
Data
my_df <-
structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
type = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L),
index = c(1L, 2L, 5L, 2L, 3L, 4L, 10L, 4L, 5L, 6L, 9L),
vol = c(200L, 244L, 33L, 66L, 2L, 199L, 319L, 290L, 188L, 573L, 122L)),
class = "data.frame", row.names = c(NA, -11L))
One dplyr and tidyr possibility could be:
df %>%
group_by(group, type) %>%
complete(index = full_seq(1:max(index), 1), fill = list(vol = 0))
group type index vol
<fct> <int> <dbl> <dbl>
1 A 1 1 200
2 A 1 2 244
3 A 1 3 0
4 A 1 4 0
5 A 1 5 33
6 A 2 1 0
7 A 2 2 66
8 A 2 3 2
9 A 2 4 199
10 A 2 5 0
11 A 2 6 0
12 A 2 7 0
13 A 2 8 0
14 A 2 9 0
15 A 2 10 319
16 B 1 1 0
17 B 1 2 0
18 B 1 3 0
19 B 1 4 290
20 B 1 5 188
21 B 1 6 573
22 B 1 7 0
23 B 1 8 0
24 B 1 9 122

Delete repeated TIME points in a dataframe

I have a simple goal that I want to achieve in my data frame that looks like this:
ID TIME AMT
1 0 100
1 1 0
1 2 0
1 2 50
1 3 0
2 0 50
2 1 0
2 2 0
2 2 100
2 3 0
How do I subset the df for unique TIME (i.e. get rid of the repeated time point that has AMT=0? To make it clearer: I want to remove duplicate TIME rows that has AMT=0.
It is not entirely clear what you're asking. I think what you want is, for each unique ID value, eliminate duplicate TIME rows, and if a duplicate row has AMT=0, prefer to delete that row rather than another duplicate (with the same TIME value) that has AMT!=0.
The best way to do this is actually to call aggregate(), and group by both ID and TIME, taking the max() of all the AMT values in all the duplicates in a group (thus this will work for duplicate groups that have more than two rows, if such existed):
df <- data.frame(id=c(1,1,1,1,1,2,2,2,2,2), time=c(0,1,2,2,3,0,1,2,2,3), amt=c(100,0,0,50,0,50,0,0,100,0) );
df;
## id time amt
## 1 1 0 100
## 2 1 1 0
## 3 1 2 0
## 4 1 2 50
## 5 1 3 0
## 6 2 0 50
## 7 2 1 0
## 8 2 2 0
## 9 2 2 100
## 10 2 3 0
aggregate(amt~id+time, df, max );
## id time amt
## 1 1 0 100
## 2 2 0 50
## 3 1 1 0
## 4 2 1 0
## 5 1 2 50
## 6 2 2 100
## 7 1 3 0
## 8 2 3 0
As you can see, the order got a little messed up, but you could easily fix that with a call to order() afterward:
df2 <- aggregate(amt~id+time, df, max );
df2[order(df2$id,df2$time),];
## id time amt
## 1 1 0 100
## 3 1 1 0
## 5 1 2 50
## 7 1 3 0
## 2 2 0 50
## 4 2 1 0
## 6 2 2 100
## 8 2 3 0
It is not entirely clear from the description, how we want to remove the duplicated elements. Suppose if there are duplicates for 'TIME', 'ID', but the 'AMT' element is neither zero nor maximum value. If we need to remove only the '0' values per combination,
library(data.table)
res1 <- setDT(df1)[, if(all(AMT==0)) .SD[1L] else .SD[AMT!=0], list(TIME,ID)]
res1[order(TIME)]
# TIME ID AMT
#1: 0 1 100
#2: 0 2 50
#3: 1 1 0
#4: 1 2 0
#5: 2 1 50
#6: 2 2 100
#7: 3 1 0
#8: 3 2 0
or if the idea of removing the duplicates was as assumed by #bgoldst, an equivalent option using data.table is
res2 <- setDT(df1)[, list(amt=max(AMT)), list(TIME, ID)]
res2[order(TIME)]
# TIME ID amt
#1: 0 1 100
#2: 0 2 50
#3: 1 1 0
#4: 1 2 0
#5: 2 1 50
#6: 2 2 100
#7: 3 1 0
#8: 3 2 0
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
TIME = c(0L, 1L, 2L, 2L, 3L, 0L, 1L, 2L, 2L, 3L), AMT = c(100L,
0L, 0L, 50L, 0L, 50L, 0L, 0L, 100L, 0L)), .Names = c("ID",
"TIME", "AMT"), class = "data.frame", row.names = c(NA, -10L))

R: adding in rows of zero based on the values in multiple columns

I am trying to append rows to an R data.frame. Here is an example of a data.frame "foo":
A B C D
1 1 1 200
1 1 2 50
1 1 3 15
1 2 1 150
1 2 4 50
1 3 1 300
2 1 2 40
2 1 4 90
2 3 2 80
For every A, there are 3 possible values of B, and for every B, there are 4 possible values of C. However, the initial df only contains non-zero values of D. I'd like to manipulate the df so that zeros are included for both B and C. Thus, the df would show 0's in D for any value of B/C that was 0. I have seen questions that address this with one column, but couldn't find a question addressing it with multiple columns. The final df would look like this:
A B C D
1 1 1 200
1 1 2 50
1 1 3 15
1 1 4 0
1 2 1 150
1 2 2 0
1 2 3 0
1 2 4 50
1 3 1 300
1 3 2 0
1 3 3 0
1 3 4 0
2 1 1 0
2 1 2 40
2 1 3 0
2 1 4 90
2 2 1 0
2 2 2 0
2 2 3 0
2 2 4 0
2 3 1 0
2 3 2 80
2 3 3 0
2 3 4 0
I first tried creating a dummy data frame that then merged with the initial df, but something isn't working right. Here's the current code, which I know is wrong because this code only generates rows based on A. I think I want to make the dummy frame based on A and B but I don't know how - could an if/else function work here?:
# create dummy df
dummy <- as.data.frame(
cbind(
sort(rep(unique(foo$A), 12)),
rep(1:3,length(unique(foo$A)))))
colnames(dummy) <- c("A","B")
foo$A <- as.numeric(foo$A)
foo$B <- as.numeric(foo$C)
# merge with foo
mergedummy <- merge(dummy,foo,all.x=T)
Any insight is greatly appreciated - thanks!
A one liner:
merge(dat, data.frame(table(dat[1:3]))[-4],all.y=TRUE)
# A B C D
#1 1 1 1 200
#2 1 1 2 50
#3 1 1 3 15
#4 1 1 4 NA
#...
Or maybe less complicated:
out <- data.frame(xtabs(D ~ ., data=dat))
out[do.call(order,out[1:3]),]
# A B C Freq
#1 1 1 1 200
#7 1 1 2 50
#13 1 1 3 15
#19 1 1 4 0
#...
Where dat is:
dat <- structure(list(A = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), B = c(1L,
1L, 1L, 2L, 2L, 3L, 1L, 1L, 3L), C = c(1L, 2L, 3L, 1L, 4L, 1L,
2L, 4L, 2L), D = c(200L, 50L, 15L, 150L, 50L, 300L, 40L, 90L,
80L)), .Names = c("A", "B", "C", "D"), class = "data.frame", row.names = c(NA,
-9L))
I created a master data frame which includes all combinations of A, B, and C as you describe in the expected outcome. Then, I merge the master data frame and your data frame. Finally, I replaced NA with 0.
master <- data.frame(A = rep(1:2, each = 12),
B = rep(1:3, each = 4),
C = rep(1:4, times = 6))
library(dplyr)
master %>%
left_join(., mydf) %>%
mutate(D = ifelse(D %in% NA, 0, D))
# A B C D
#1 1 1 1 200
#2 1 1 2 50
#3 1 1 3 15
#4 1 1 4 0
#5 1 2 1 150
#6 1 2 2 0
#7 1 2 3 0
#8 1 2 4 50
#9 1 3 1 300
#10 1 3 2 0
#11 1 3 3 0
#12 1 3 4 0
#13 2 1 1 0
#14 2 1 2 40
#15 2 1 3 0
#16 2 1 4 90
#17 2 2 1 0
#18 2 2 2 0
#19 2 2 3 0
#20 2 2 4 0
#21 2 3 1 0
#22 2 3 2 80
#23 2 3 3 0
#24 2 3 4 0
Here is one solution:
foo <- merge(expand.grid(lapply(foo[,1:3], unique)), foo, all=TRUE, sort=TRUE)
foo[is.na(foo)] <- 0

Resources