Related
I am slowly working through a data transformation using R and the dplyr package. I started with unique rows per respondent. The data come from a conjoint experiment, so I have needed to head toward profiles (profile A or B in the experiment) nested in experimental iteration (each respondent took the experiment 5 times) nested in respondent ID.
I have successfully transformed the data to get experiments nested in respondent IDs. Now I have multiple columns X1-Xn that contain the attribute features. However, these effectively repeat attributes at this point, with, say, X1 including a variable for profile A in the experiment and X6 including the same variable but for profile B.
In the mocked image example below, I would basically need to merge columns v1a and v1b as just v1, v2a and v2b as just v2 and so forth, while generating a new column that delimits if they are from a or b.
Following up on the comments and helpful but not quite what is needed engagement with this original post, I have edited the post to include simple code for both the original data structure and the ideal outcome data:
#original dataframe
ID <- c(1, 1, 1, 2, 2, 2)
`Ex ID` <- c(1, 2, 3, 1, 2, 3)
v1a <- c(2, 4, 5, 1, 3, 5)
v2a = c(3, 4, 5, 2, 1, 5)
v3a = c(5, 4, 3, 3, 2, 1)
v1b = c(4, 5, 5, 1, 5, 4)
v2b = c(5, 2, 2, 4, 1, 4)
v3b = c(5, 5, 4, 5, 4, 5)
original <- data.frame(ID, 'Ex ID' , v1a, v2a, v3a, v1b, v2b,
v3b)
#wanted data frame
ID <- c(1, 1, 1, 1, 1, 1)
`Ex ID` <- c(1, 1, 2, 2, 3, 3)
profile <- c("a", "b", "a", "b", "a", "b")
v1ab = c(2, 4, 4, 5, 5, 5)
v2ab = c(3, 5, 4, 2, 5, 2)
v3ab = c(5, 5, 4, 5, 3, 4)
desired <- data.frame(ID, 'Ex ID', profile, v1ab, v2ab, v3ab)
I basically want to find a way to nest multiple variables within ID, experiment ID, profile IDs.
Any guidance would be greatly appreciated.
Let's take a look at a minimal working example.
df<-data.frame(ID=c(1,1,1,2,2,3),v1a=c(2,4,5,1,3,5),v1b=c(4,5,5,1,5,4))
To merge the columns v1a and v1b we can use the command paste, which concatenates strings. The new column is created using mutate which cames with the dplyr package.
df <- mutate(df,v1=paste(df$v1a,",",df$v1b, sep=""))
Result:
ID v1a v1b v1
1 1 2 4 2,4
2 1 4 5 4,5
3 1 5 5 5,5
4 2 1 1 1,1
5 2 3 5 3,5
6 3 5 4 5,4
If you want to get rid of the "old" columns v1a and v1b, you can use select
df <- select(df,- (v1a | v1b))
which results in
ID v1
1 1 2,4
2 1 4,5
3 1 5,5
4 2 1,1
5 2 3,5
6 3 5,4
We could do this with base R using sapply:
cols <- split(names(df)[-c(1,2)], substr(names(df)[-c(1,2)], start = 1, stop = 2))
cbind(df[c(1,2)], sapply(names(cols), function(col) {
do.call(paste, c(df[cols[[col]]], sep = ","))
}))
Output:
ID Ex_ID v1 v2
1 1 1 2,4 3,5
2 1 2 4,5 4,2
3 1 3 5,5 5,2
4 2 1 1,1 2,4
5 2 2 3,5 1,1
6 2 3 5,4 5,4
7 3 1 4,4 2,5
8 3 2 1,1 5,4
9 3 3 4,5 1,2
data:
df <- tibble(ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Ex_ID = c(1,
2, 3, 1, 2, 3, 1, 2, 3), v1a = c(2, 4, 5, 1, 3, 5, 4, 1, 4),
v2a = c(3, 4, 5, 2, 1, 5, 2, 5, 1), v1b = c(4, 5, 5, 1, 5,
4, 4, 1, 5), v2b = c(5, 2, 2, 4, 1, 4, 5, 4, 2))
I have a dataframe that looks like this:
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
State <- c('AZ', 'IA', 'MN', 'NY', 'IL', 'FL', 'TX', 'TN', 'LA', 'ND')
ABC1_current <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC2_current <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC1_future <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_future <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ID, State, ABC1_current, ABC2_current, ABC1_future, ABC2_future)
I am trying to dynamically place the columns with the future suffix to the right of the columns with the current suffix for the given prefix (ABC1, ABC2, etc.) The ID and State columns don't move at all. Here's what I am hoping to get as a result:
df2 <- data.frame(ID, State, ABC1_current, ABC1_future, ABC2_current, ABC2_future)
Is there a way to interlace columns like this dynamically? Ideally, I'd like to use dplyr if possible.
Although not purely dplyr, this may help.
This takes advantage of ordering logic by number value.
library(dplyr)
df1 <- df %>% select("ID","State")
df2 <- df %>% select(-c("ID","State"))
index <- sort(colnames(df2))
df3 <- merge(df1,df2[index])
df3
ID State ABC1_current ABC1_future ABC2_current ABC2_future
1 1.1 AZ 1 2 4 2
2 1.2 IA 1 2 4 2
3 1.3 MN 1 2 4 2
4 1.4 NY 1 2 4 2
5 1.5 IL 1 2 4 2
6 1.6 FL 1 2 4 2
7 1.7 TX 1 2 4 2
8 1.8 TN 1 2 4 2
9 1.9 LA 1 2 4 2
10 2.0 ND 1 2 4 2
Basically the same idea as #PeteKittinun's:
library(dplyr)
df %>%
select(ID, State, sort(colnames(.)[3:ncol(.)]))
returns
> df %>%
+ select(ID, State, sort(colnames(.)[3:ncol(.)]))
ID State ABC1_current ABC1_future ABC2_current ABC2_future
1 1.1 AZ 1 2 4 2
2 1.2 IA 5 5 5 5
3 1.3 MN 3 3 5 5
4 1.4 NY 4 5 4 1
5 1.5 IL 3 3 2 2
This question already has answers here:
How collect additional row data on binned data in R
(1 answer)
Group value in range r
(3 answers)
Closed 3 years ago.
I am doing a statistic analysis in a big data frame (more than 48.000.000 rows) in r. Here is an exemple of the data:
structure(list(herd = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), cows = c(1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16), `date` = c("11/03/2013",
"12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013", "16/03/2013",
"13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012", "17/05/2012",
"18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016", "13/07/2016",
"11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013", "15/03/2013",
"16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012", "16/05/2012",
"17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016", "12/07/2016",
"13/07/2016", "11/03/2013", "12/03/2013", "13/03/2013", "14/03/2013",
"15/03/2013", "16/03/2013", "13/05/2012", "14/05/2012", "15/05/2012",
"16/05/2012", "17/05/2012", "18/05/2012", "10/07/2016", "11/07/2016",
"12/07/2016", "13/07/2016"), glicose = c(240666, 23457789, 45688688,
679, 76564, 6574553, 78654, 546432, 76455643, 6876, 7645432,
876875, 98654, 453437, 98676, 9887554, 76543, 9775643, 986545,
240666, 23457789, 45688688, 679, 76564, 6574553, 78654, 546432,
76455643, 6876, 7645432, 876875, 98654, 453437, 98676, 9887554,
76543, 9775643, 986545, 240666, 23457789, 45688688, 679, 76564,
6574553, 78654, 546432, 76455643, 6876)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -48L))
I need to identify how many cows are in the following category of glicose by herd and by date:
<=100000
100000 and <=150000
150000 and <=200000
200000 and <=250000
250000 and <=400000
>400000
I tried to use the functions filter() and select() but could not categorize the variable like that.
I tried either to make a vector for each category but it did not work:
ht <- df %>% group_by(herd, date) %>%
filter(glicose < 100000)
Actually I do not have a clue of how I could do this. Please help!
I expect to get the number of cows in each category of each herd based on each date in a table like this:
Calling your data df,
df %>%
mutate(glicose_group = cut(glicose, breaks = c(0, seq(1e5, 2.5e5, by = 0.5e5), 4e5, Inf)),
date = as.Date(date, format = "%d/%m/%Y")) %>%
group_by(herd, date, glicose_group) %>%
count
# # A tibble: 48 x 4
# # Groups: herd, date, glicose_group [48]
# herd date glicose_group n
# <dbl> <date> <fct> <int>
# 1 1 2012-05-13 (0,1e+05] 1
# 2 1 2012-05-14 (4e+05,Inf] 1
# 3 1 2012-05-15 (4e+05,Inf] 1
# 4 1 2012-05-16 (0,1e+05] 1
# 5 1 2012-05-17 (4e+05,Inf] 1
# 6 1 2012-05-18 (4e+05,Inf] 1
# 7 1 2013-03-11 (2e+05,2.5e+05] 1
# 8 1 2013-03-12 (4e+05,Inf] 1
# 9 1 2013-03-13 (4e+05,Inf] 1
# 10 1 2013-03-14 (0,1e+05] 1
# # ... with 38 more rows
I also threw in a conversion to Date class, which is probably a good idea.
I want to expand a vector of integers into consecutive integers in each group in r. Can anyone have some hints on this problem?
Below is my original dataset:
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
data = data.frame(x, group)
and my desired dataset is as below:
desired_data = data.frame(
x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3))
Thanks for your help!
This can be easily done via expand from tidyr,
library(tidyverse)
df %>%
group_by(group) %>%
expand(x = full_seq(x, 1))
Which gives,
# A tibble: 19 x 2
# Groups: group [3]
group x
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
11 2 6
12 3 1
13 3 2
14 3 3
15 3 4
16 3 5
17 3 6
18 3 7
19 3 8
I'm sure someone will have a cleaner solution soon. In the meantime:
minVals=aggregate(data$x, by = list(data$group), min)[,2]
maxVals=aggregate(data$x, by = list(data$group), max)[,2]
ls=apply(cbind(minVals,maxVals),1,function(x) x[1]:x[2])
desired_data = data.frame(
x = unlist(ls),
group = rep(unique(data$group),lapply(ls,length)))
x group
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 1 2
7 2 2
8 3 2
9 4 2
10 5 2
11 6 2
12 1 3
13 2 3
14 3 3
15 4 3
16 5 3
17 6 3
18 7 3
19 8 3
Here's a base R solution.
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
sl = split(x,group)
expanded = lapply(names(sl),function(x){
r = range(sl[[x]])
return(data.frame(x = seq(r[1],r[2],1),group = x))
})
do.call(rbind,expanded)
split x by group which results in a named list per group
using lapply on the names we can expand the integer range for each group
finally use do.call to rbind the results together.
I have two data frames in R
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
> df1
Cust ItemId
1 1 1
2 1 2
3 1 3
4 1 4
5 2 2
6 2 3
7 2 2
8 2 5
9 3 1
10 3 2
11 3 5
12 4 6
13 4 2
> df2
ItemId1 ItemId2
1 1 3
2 3 1
3 2 2
4 3 3
5 2 4
6 1 1
7 2 6
8 3 4
9 4 2
10 6 4
11 5 3
12 3 1
13 2 3
14 4 5
All I need is the following output which is less costly than joins/merge (because in real time I am dealing with billions of records)
> output
ItemId1 ItemId2 Cust
1 1 3 1
2 3 1 1
3 2 2 1, 2, 3, 4
4 3 3 1, 2
5 2 4 1
6 1 1 1, 3
7 2 6 4
8 3 4 1
9 4 2 1
10 6 4 NA
11 5 3 2
12 3 1 1
13 2 3 1, 2
14 4 5 NA
What happens is If ItemId1, ItemId2 of df2 combination is present in ItemId of df1 we need to return the Cust values (even if they are multiple). If they are present we need to return NA.
i.e. Take the first row as example: ItemId1 = 1, ItemId2 = 3. Only Customer = 1 has ItemId = c(1,3) in df1. Similarly the next rows.
We can do this using Joins/Merge which are costly operations. But, they are resulting in Memory Error.
This may take more time but wont take much of your memory.
Please convert for loops using apply if possible:
library(plyr)
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
temp2 = ddply(df1[,c("Cust","ItemId")], .(Cust), summarize, ItemId = toString(unique(ItemId)))
temp3 = ddply(df1[,c("ItemId","Cust")], .(ItemId), summarize, Cust = toString(unique(Cust)))
dfout = cbind(df2[0,],data.frame(Cust = df1[0,1]))
for(i in 1:nrow(df2)){
a = df2[i,1]
b = df2[i,2]
if(a == b){
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = a,Cust = temp3$Cust[temp3$ItemId == a]))
}else{
cusli = c()
for(j in 1:nrow(temp2)){
if(length(grep(a,temp2$ItemId[j]))>0 & length(grep(b,temp2$ItemId[j]))>0){
cusli = c(cusli,temp2$Cust[j])
}
}
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = b,Cust = paste(cusli,collapse = ", ")))
}
}
dfout$Cust[dfout$Cust == "",] = NA