R : ddply and return string - r

I have a dataframe like this:
id col1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 1
6 1 2
7 2 3
8 3 4
I would like to group by id's then create a string that contains the values in col1 separated by a space and in descending value.
I first order the data frame by id and col1 but am unable to get the output from ddply as a string with no quotes.
df111 <- df111[order(df111$id, -df111$col1),]
df222 <- ddply(df111, .(id), function(col1) as.character(paste0(col1,sep = ' ')))
id V1 V2
1 1 c(1, 1, 1, 1) c(0.793507214868441, 0.539258575299755, 0.165128685068339, 0.153290810529143)
2 2 c(2, 2, 2, 2) c(0.872032727580518, 0.827515688957646, 0.236087603960186, 0.165240615839139)
3 3 c(3, 3, 3, 3) c(0.759382889838889, 0.484359077410772, 0.182580581633374, 0.0723447729833424)
4 4 c(4, 4, 4, 4) c(0.874859027564526, 0.642130059422925, 0.0569298807531595, 0.0227038362063468)
5 5 c(5, 5, 5, 5) c(0.392553070792928, 0.386064056074247, 0.299609177513048, 0.222290486795828)
I'd like some thing like this:
id V1
1 1 .793507214868441 0.539258575299755 0.165128685068339 0.153290810529143
Any suggestions?
EDIT:
> dput(df111)
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L), col1 = c(0.793507214868441,
0.539258575299755, 0.165128685068339, 0.153290810529143, 0.872032727580518,
0.827515688957646, 0.236087603960186, 0.165240615839139, 0.759382889838889,
0.484359077410772, 0.182580581633374, 0.0723447729833424, 0.874859027564526,
0.642130059422925, 0.0569298807531595, 0.0227038362063468, 0.392553070792928,
0.386064056074247, 0.299609177513048, 0.222290486795828)), .Names = c("id",
"col1"), row.names = c(1L, 11L, 16L, 6L, 7L, 12L, 17L, 2L, 18L,
13L, 8L, 3L, 14L, 9L, 19L, 4L, 20L, 10L, 5L, 15L), class = "data.frame")

I think maybe you just need to use summarise rather than a custom anonymous function...?
dat <- read.table(text = "id col1
1 1 1
2 2 2
3 3 3
4 4 4
5 5 1
6 1 2
7 2 3
8 3 4",header = TRUE,sep = "")
> ddply(dat,.(id),summarise,val = paste(sort(col1,decreasing = TRUE),collapse = " "))
id val
1 1 2 1
2 2 3 2
3 3 4 3
4 4 4
5 5 1

Related

New data frame, if specific value(s) is contained AND other values aren't included in a range of columns in r

So, I have a large data frame with monthly observations of n individuals.
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
A 33 6 1 2 1 5
B 36 5 0 2 1 5
C 22 4 1 NA 1 5
D 2 2 0 2 1 5
E 5 2 1 2 1 6
F 7 1 0 2 1 5
G 8 6 1 2 1 5
H 2 8 0 2 2 5
I 1 3 1 2 1 5
J 3 2 0 2 1 5
I want to create a new data frame, in which include the individuals who meet some specific conditions.
E.g. if, for individual i, the range of column y_0101:y_0312 does NOT include values of 3 & 6 & NA, AND include values of 2 | 1 THEN for individual i should be included in new data frame. Which produce the following data frame:
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
B 36 5 0 2 1 5
D 2 2 0 2 1 5
F 7 1 0 2 1 5
H 2 8 0 2 2 5
I tried different ways, but I can't figure out how to get multiple conditions included.
df <- df %>% filter(vars(starts_with("y_"))!=3 | !=6 | != NA)
or
df <- df %>% filter_at(vars(starts_with("y_")), all_vars(!=3 | !=6 | != NA)
I've tried some other things as well, like !%in%, but that doesn't seem to work. Any ideas?
I think you're almost there, but might need a slight shift in the logic:
df <- data.frame(A1 = 1:10,
A2 = 10:1,
A3 = 1:10,
B1 = 1:10)
df %>%
filter_at(vars(starts_with("A")), ~!(.x %in% c(3, 6, NA))) %>%
filter(if_any(starts_with("A"), ~ .x %in% c(1, 2)))
In the first step, I filter out all rows where any of the columns are 3, 6, or NA. In the second row, I filter down to only rows where at least one of the columns is 1 or 2. Does this help with your case?
Here is a base R option using rowSums :
cols <- grep('y_', names(df))
include <- c(1, 2)
not_include <- c(3, 6, NA)
result <- subset(df, rowSums(sapply(df[cols], `%in%`, include)) > 0 &
rowSums(sapply(df[cols], `%in%`, not_include)) == 0)
result
# ind y_0101 y_0102 y_0103 y_0104 y_0311 y_0312
#2 B 36 5 0 2 1 5
#4 D 2 2 0 2 1 5
#6 F 7 1 0 2 1 5
#8 H 2 8 0 2 2 5
data
df <- structure(list(ind = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"), y_0101 = c(33L, 36L, 22L, 2L, 5L, 7L, 8L, 2L, 1L,
3L), y_0102 = c(6L, 5L, 4L, 2L, 2L, 1L, 6L, 8L, 3L, 2L), y_0103 = c(1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L), y_0104 = c(2L, 2L, NA, 2L,
2L, 2L, 2L, 2L, 2L, 2L), y_0311 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L), y_0312 = c(5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L
)), class = "data.frame", row.names = c(NA, -10L))

Choose rows in which the absolute value of subtraction is less a specified value

Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))

Is there any R function to make this happen?

Hi this is an excel form of data i want to be able to create in R
Just want to make it clear, I need to be able to make the column Group_fix equal to 5 for the next 12 months period observation, every time an observation date has 5 in its Group column.
How to make it possible in R? Can we use ifelse function?
Here is an approach with lag from dplyr.
library(dplyr)
data %>%
mutate(GroupFix = case_when(Group == 5 |
lag(Group,2) == 5 |
lag(Group,2) == 5 |
lag(Group,3) == 5 |
lag(Group,4) == 5 |
lag(Group,5) == 5 |
lag(Group,6) == 5 |
lag(Group,7) == 5 |
lag(Group,8) == 5 |
lag(Group,9) == 5 |
lag(Group,10) == 5 |
lag(Group,11) == 5 ~ 5,
TRUE ~ as.numeric(Group)))
Observation.Date Group GroupFix
1 12/31/19 1 1
2 1/31/20 2 2
3 2/29/20 2 2
4 3/31/20 2 2
5 4/30/20 3 3
6 5/31/20 4 4
7 6/30/20 5 5
8 7/31/20 5 5
9 8/31/20 4 5
10 9/30/20 3 5
11 10/31/20 2 5
12 11/30/20 3 5
13 12/31/20 4 5
14 1/31/21 5 5
15 2/28/21 5 5
16 3/31/21 4 5
17 4/30/21 3 5
18 5/31/21 2 5
19 6/30/21 1 5
20 7/31/21 1 5
21 8/31/21 1 5
22 9/30/21 1 5
23 10/31/21 1 5
24 11/30/21 1 5
25 12/31/21 1 5
26 1/31/22 1 5
27 2/28/22 1 1
Data
data <- structure(list(Observation.Date = structure(c(8L, 1L, 13L, 14L,
16L, 18L, 20L, 22L, 24L, 26L, 4L, 6L, 9L, 2L, 11L, 15L, 17L,
19L, 21L, 23L, 25L, 27L, 5L, 7L, 10L, 3L, 12L), .Label = c("1/31/20",
"1/31/21", "1/31/22", "10/31/20", "10/31/21", "11/30/20", "11/30/21",
"12/31/19", "12/31/20", "12/31/21", "2/28/21", "2/28/22", "2/29/20",
"3/31/20", "3/31/21", "4/30/20", "4/30/21", "5/31/20", "5/31/21",
"6/30/20", "6/30/21", "7/31/20", "7/31/21", "8/31/20", "8/31/21",
"9/30/20", "9/30/21"), class = "factor"), Group = c(1L, 2L, 2L,
2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 3L, 4L, 5L, 5L, 4L, 3L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-27L))

Finding difference between specific rows by group

Within a group, I want to find the difference between that row and the first time that user appeared in the data. For example, I need to create the diff variable below. Users have different number of rows each as in the following data:
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L), diff = c(NA, 3L, 4L,
6L, NA, 2L, 3L, NA, NA, 8L)), .Names = c("ID", "money", "occurence",
"diff"), class = "data.frame", row.names = c(NA, -10L))
ID money occurence diff
1 1 9 1 NA
2 1 12 2 3
3 1 13 3 4
4 1 15 4 6
5 2 5 1 NA
6 2 7 2 2
7 2 8 3 3
8 3 5 1 NA
9 4 2 1 NA
10 4 10 2 8
You can use ave(). We just remove the first value per group and replace it with NA, and subtract the first value from the rest of the values.
with(df, ave(money, ID, FUN = function(x) c(NA, x[-1] - x[1])))
# [1] NA 3 4 6 NA 2 3 NA NA 8
A dplyr solution, which uses the first function to get the first value and calculate the difference.
library(dplyr)
df2 <- df %>%
group_by(ID) %>%
mutate(diff = money - first(money)) %>%
mutate(diff = replace(diff, diff == 0, NA)) %>%
ungroup()
df2
# # A tibble: 10 x 4
# ID money occurence diff
# <int> <int> <int> <int>
# 1 1 9 1 NA
# 2 1 12 2 3
# 3 1 13 3 4
# 4 1 15 4 6
# 5 2 5 1 NA
# 6 2 7 2 2
# 7 2 8 3 3
# 8 3 5 1 NA
# 9 4 2 1 NA
# 10 4 10 2 8
Update
Here is a data.table solution provided by Sotos. Notice that no need to replace 0 with NA.
library(data.table)
setDT(df)[, money := money - first(money), by = ID][]
# ID money occurence diff
# 1: 1 0 1 NA
# 2: 1 3 2 3
# 3: 1 4 3 4
# 4: 1 6 4 6
# 5: 2 0 1 NA
# 6: 2 2 2 2
# 7: 2 3 3 3
# 8: 3 0 1 NA
# 9: 4 0 1 NA
# 10: 4 8 2 8
DATA
dput(df)
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L),
money = c(9L, 12L, 13L, 15L, 5L, 7L, 8L, 5L, 2L, 10L), occurence = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 1L, 2L)), .Names = c("ID", "money",
"occurence"), row.names = c(NA, -10L), class = "data.frame")

Remove duplicated 2 columns permutations

I can't find a good title for this question so feel free to edit it please.
I have this data.frame
section time to from
1 a 9 1 2
2 a 9 2 1
3 a 12 2 3
4 a 12 2 4
5 a 12 3 2
6 a 12 3 4
7 a 12 4 2
8 a 12 4 3
I want to remove duplicated rows that have the same to and from simultaneously, without computing permutations of the 2 columns: e.g (1,2) and (2,1) are duplicated.
So final output would be:
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
I have a solution by constructing a new column key e.g
key <- paste(min(to,from),max(to,from))
and remove duplicated key using duplicated, but I think this is dirty solution.
here the dput of my data
structure(list(section = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "a", class = "factor"), time = c(9L, 9L, 12L,
12L, 12L, 12L, 12L, 12L), to = c(1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L), from = c(2L, 1L, 3L, 4L, 2L, 4L, 2L, 3L)), .Names = c("section",
"time", "to", "from"), row.names = c(NA, -8L), class = "data.frame")
mn <- pmin(s$to, s$from)
mx <- pmax(s$to, s$from)
int <- as.numeric(interaction(mn, mx))
s[match(unique(int), int),]
section time to from
1 a 9 1 2
3 a 12 2 3
4 a 12 2 4
6 a 12 3 4
Credit for the idea goes to this question: Remove consecutive duplicates from dataframe and specifically #MatthewPlourde's answer.
You can try using sort within the apply function to order the combinations.
mydf[!duplicated(t(apply(mydf[3:4], 1, sort))), ]
# section time to from
# 1 a 9 1 2
# 3 a 12 2 3
# 4 a 12 2 4
# 6 a 12 3 4

Resources