I simply want to take a dataframe with two columns, one with a grouping variable and the second with values, and transform it so that the grouping variable becomes columns with the appropriate values. A very simple question, but after searching for about an hour, I cannot find a good answer. Here is a toy example:
var <- c("Var1", "Var1", "Var2", "Var2")
value <- c(1, 2, 3, 4)
df <- data.frame(var, value)
df.one <- df[df$var == "Var1", ]
df.two <- df[df$var == "Var2", ]
desired.df <- data.frame(df.one[2], df.two[2])
colnames(desired.df) <- c("Var1", "Var2")
desired.df
With more variables and values, this bit of code could become extremely clunky. Can anyone suggest a better method? Any advice would be greatly appreciated!
Data:
df <- structure(list(var = structure(c(1L, 1L, 2L, 2L),
.Label = c("Var1", "Var2"), class = "factor"),
value = c(1, 2, 3, 4)), .Names = c("var", "value"),
class = "data.frame", row.names = c(NA, -4L))
It looks like it is useful to introduce a new variable that identifies the observation within var (I call this case below); you can remove it after reshaping it if you like.
With reshape2/plyr:
library("plyr")
library("reshape2")
## add 'case' identifier
df <- ddply(df,"var",mutate,case=1:length(var))
## dcast() to reshape; then drop identifier
dcast(df,case~var)[,-1]
With tidyr (same strategy):
library("tidyr")
library("dplyr")
df %>% group_by(var) %>%
mutate(case=seq(n())) %>%
spread(var,value) %>%
select(-case)
This could probably be done with reshape() in base R as well, but I have never been able to figure it out ...
Base R solution:
data.frame(split(df$value,df$var))
# Var1 Var2
#1 1 3
#2 2 4
This solution implies that all 'VarN' subsets have equal length.
More general solution will be:
z <- split(df$value,df$var)
max.length <- max(sapply(z,length))
data.frame(lapply(z,`length<-`,max.length))
which appends NA to shorter lists to make sure that all lists have the same length.
Related
The overarching problem:
I have my data tibble,
dat <- tibble(var1 = c("rnw wnd", "rnw wat"),
var2 = c("elc", NA))
I have another dataset which is a set of rules for pattern matching, such that if var1, var2 with combine_rule == T, the grp is assigned.
patterns <- tibble(var1_patterns = c("rnw", "wnd", NA),
var2_patterns = c("elc", NA, "elc"),
combine_rule = c("&", NA, NA),
grp = c("elc_rnw", "wnd", "elc"))
I would like to append a list column to dat that contains all the grps that the var1, var2 combination satisfy the rules for.
So the result would be:
dat <- tibble(var1 = c("rnw wnd", "rnw wat"),
var2 = c("elc", NA),
grp = c(list(c("elc_rnw", "wnd", "elc")),
list(NA))
)
Simpler problem
That is the full problem, which I realise is quite a lot. In the first instance, it would be helpful to get input into how I map str_match(var1, var1_pattern) to create a list column ignoring the logical relation between var1 and var2. So the result would be:
dat <- tibble(var1 = c("rnw wnd", "rnw wat"),
var2 = c("elc", NA),
grp = c(list( c("elc_rnw", "wnd")),
list("elc_rnw"))
)
I thought of something like mapping str_match,
dat %>%
mutate(grp = map(var1, ~str_match(.x, pattern$var1_pattern))
to create a new list column. But I do not know how to map over rows of pattern to create a list column. There is a loop option, but I am trying my best to avoid that!
I should also add, that pattern and dat will be arguments to a function, so I (think that I) cannot use case_when for pattern matching.
Any suggestions for either the simple problem, or overarching problem would be much appreciated.
(Apologies if this is a duplicate, but I have not found questions, perhaps because I have not phrased the question appropriately)
I think that I have found a solution for the overaching problem for the specific case of the logical "&" in pattern.
If someone has suggestions for a more elegant way of going about this, or any other suggestions, they would be most welcome.
# Create dat and patterns
dat <- tibble::tibble(var1 = c("rnw wnd", "rnw wat"),
var2 = c("elc", NA))
patterns <- tibble::tibble(var1_patterns = c("rnw", "wnd", NA),
var2_patterns = c("elc", NA, "elc"),
combine_rule = c("&", NA, NA),
grp = c("elc_rnw", "wnd", "elc"))
# function to return logical matches if pattern detected
check_for_matches <- function(var, patterns){
out <- stringr::str_detect(var, patterns)
# if var is missing, we want to return F for all matches
if(is.na(var)){
out <- replace(out, 1:length(out), F)
}
out
}
dat %>%
#create logicals for detection of var1 and var2 seperately
dplyr::mutate(var1_check = purrr::map(var1,
~check_for_matches(.x,
patterns$var1_patterns))) %>%
dplyr::mutate(var2_check = purrr::map(var2,
~check_for_matches(.x,
patterns$var2_patterns))) %>%
#append the group column
dplyr::mutate(grp = list(patterns$grp)) %>%
# unnest because we are working across columns
tidyr::unnest(c(var1_check, var2_check, grp)) %>%
#create logical for joining var1_check and var2_check and accounting for NAs
dplyr::mutate(joined = dplyr::if_else(is.na(var1_check & var2_check) == F,
var1_check & var2_check,
dplyr::if_else(is.na(var1_check), var2_check,
var1_check)),
# if joined T, then grp
grp = dplyr::if_else(joined, grp, NA_character_)) %>%
dplyr::select(var1, var2, grp) %>%
tidyr::nest(data = c(grp))
I have a dataframe (df1) that contains Start times and End times for observations of different IDs:
df <- structure(list(ID = 1:4, Start = c("2021-05-12 13:22:00", "2021-05-12 13:25:00", "2021-05-12 13:30:00", "2021-05-12 13:42:00"),
End = c("2021-05-13 8:15:00", "2021-05-13 8:17:00", "2021-05-13 8:19:00", "2021-05-13 8:12:00")),
class = "data.frame", row.names = c(NA,
-4L))
I want to create a new dataframe that shows the latest Start time and the earliest End time for each possible pairwise comparison between the levels ofID.
I was able to accomplish this by making a duplicate column of ID called ID2, using dplyr::expand to expand them, and saving that in an object called Pairs:
library(dplyr)
df$ID2 <- df$ID
Pairs <-
df%>%
expand(ID, ID2)
Making two new objects a and b that store the Start and End times for each comparison separately, and then combining them into df2:
a <- left_join(df, Pairs, by = 'ID')%>%
rename(StartID1 = Start, EndID1 = End, ID2 = ID2.y)%>%
select(-ID2.x)
b <- left_join(Pairs, df, by = "ID2")%>%
rename(StartID2 = Start, EndID2 = End)%>%
select(ID2, StartID2, EndID2)
df2 <- cbind(a,b)
df2 <- df2[,-4]
and finally using dplyr::if_else to find the LatestStart time and the EarliestEnd time for each of the comparisons:
df2 <-
df2%>%
mutate(LatestStart = if_else(StartID1 > StartID2, StartID1, StartID2),
EarliestEnd = if_else(EndID1 > EndID2, EndID2, EndID1))
This seems like such a simple task to perform, is there a more concise way to achieve this from df1 without creating all of these extra objects?
For such computations usually outer comes handy:
df %>%
mutate(across(c("Start", "End"), lubridate::ymd_hms)) %>%
{
data.frame(
ID1 = rep(.$ID, each = nrow(.)),
ID2 = rep(.$ID, nrow(.)),
LatestStart = outer(.$Start, .$Start, pmax),
LatestEnd = outer(.$End, .$End, pmin)
)
}
I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]
I have the follow dataset:
dataset=structure(list(var1 = c(28.5627505742013, 22.8311421908438, 95.2216156944633,
43.9405107684433, 97.11211245507, 48.4108281508088, 77.1804554760456,
27.1229329891503, 69.5863061584532, 87.2112890332937), var2 = c(32.9009465128183,
54.1136392951012, 69.3181485682726, 70.2100433968008, 44.0986660309136,
62.8759404085577, 79.4413498230278, 97.4315509572625, 62.2505457513034,
76.0133410431445), var3 = c(89.6971945464611, 67.174579706043,
37.0924087055027, 87.7977314218879, 29.3221596442163, 37.5143952667713,
62.6237869635224, 71.3644423149526, 95.3462834469974, 27.4587387405336
), var4 = c(41.5336912125349, 98.2095112837851, 80.7970978319645,
91.1278881691396, 66.4086666144431, 69.2618868127465, 67.7560870349407,
71.4932355284691, 21.345994155854, 31.1811877787113), var5 = c(33.9312525652349,
88.1815139763057, 98.4453701227903, 25.0217059068382, 41.1195872165263,
37.0983888953924, 66.0217586159706, 23.8814191706479, 40.9594196081161,
79.7632974945009), var6 = c(39.813664201647, 80.6405956856906,
30.0273275375366, 34.6203793399036, 96.5195455029607, 44.5830867439508,
78.7370151281357, 42.010761089623, 23.0079878121614, 58.0372223630548
), kmeans = structure(c(2L, 1L, 3L, 1L, 3L, 1L, 1L, 1L, 2L, 3L
), .Label = c("1", "2", "3"), class = "factor")), .Names = c("var1",
"var2", "var3", "var4", "var5", "var6", "kmeans"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
And the follow function:
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
With dplyr::summarise only, the result is ok:
library(tidyverse)
my1<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(sum,mean,sd))
But, with myfun doesn't work:
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=funs(myfun))
Error in summarise_impl(.data, dots) :
Column var1 must be length 1 (a summary value), not 3
What's the problem?
You can try this approach, Your approach will not yield the correct result as there it is not able to wrap two values returned by your custom function in a single cell, to circumvent the problem, I used enframe with list in the custom function:
library(tidyverse)
myfun<-function(x){
return(list(enframe(c('sum' = sum(x),'mean' = mean(x),'sd' = sd(x)))))
}
For example with mtcars data:
my2<-mtcars%>%
summarise_at(c('mpg','drat'), function(x) myfun(x)) %>%
unnest() %>%
select(-name1) %>%
set_names(nm = c('name', 'mpg', 'drat'))
it will yield:
name mpg drat
1 sum 642.900000 115.0900000
2 mean 20.090625 3.5965625
3 sd 6.026948 0.5346787
Also, there is one alternate way in which you can try solving it using purrr.
For example:
f <- function(x,...){
list('mean' = mean(x, ...),'sum' = sum(x, ...))
}
mtcars %>%
select(mpg, drat) %>%
map_dfr(~ f(.x, na.rm=T), .id ="Name") %>%
data.frame()
When you are applying this function
dataset%>% summarise_if(is.numeric,.funs=funs(sum,mean,sd))
You are applying three different function (sum, mean and sd) which is applied to all columns individually. So every column which is numeric these function would be applied to them. Here we have got three different function returning three values.
Regarding your function, I think what you were trying to do was
myfun<-function(x){
c(sum(x),mean(x),sd(x))
}
Now , when this function is applied to one column it returns you three values, so here one function is returning you three values instead.
myfun(dataset$var1)
#[1] 597.17994 59.71799 29.03549
As #NelsonGon mentioned in the comments, you are trying to store three values in single column. You could return them as list as #Pkumar showed or some variation of do also would help you achieve that. If you break down the functions and make three functions separately, it would work the same way as you have shown earlier.
myfun1 <- function(x) sum(x)
myfun2 <- function(x) mean(x)
myfun3 <- function(x) sd(x)
dataset %>% summarise_if(is.numeric,.funs=funs(myfun1,myfun2,myfun3))
it's not the most elegant way, but if your external function is just a list of other functions, maybe you can just use a list for your functions:
myfun_ls <- list(sum,mean,sd)
my2<-dataset%>%
summarise_if(.,is.numeric,.funs=myfun_ls)
I have a dataframe with a two odd variables. For one one variable, each cell stores a list whose contents is simply a vector of two numbers. For the other variable, each cell stores a three dimensional array (even though only two dimensions are necessary) of 8 numbers.
I want to simplify the dataset by breaking out the odd variable into separate variables. I figured out how to break all the data out using a for loop but this is very slow. I know apply is supposed to be generally quicker, but I can't figure out how I would translate this to apply. Is it possible, or is there a better way to do this?
for (i in 1:nrow(df)){
if (length(df$coordinates.coordinates[[i]]>0)){
df[i,"coordinates.lon"]<- df$coordinates.coordinates[[i]][1]
df[i,"coordinates.lat"]<- df$coordinates.coordinates[[i]][2]
}
if (length(df$place.bounding_box.coordinates[[i]]>0)){
df[i,"place.bounding_box.a.lon"] <-df$place.bounding_box.coordinates[[i]][1,1,1]
df[i,"place.bounding_box.b.lon"] <-df$place.bounding_box.coordinates[[i]][1,2,1]
df[i,"place.bounding_box.c.lon"] <-df$place.bounding_box.coordinates[[i]][1,3,1]
df[i,"place.bounding_box.d.lon"] <-df$place.bounding_box.coordinates[[i]][1,4,1]
df[i,"place.bounding_box.a.lat"] <-df$place.bounding_box.coordinates[[i]][1,1,2]
df[i,"place.bounding_box.b.lat"] <-df$place.bounding_box.coordinates[[i]][1,2,2]
df[i,"place.bounding_box.c.lat"] <-df$place.bounding_box.coordinates[[i]][1,3,2]
df[i,"place.bounding_box.d.lat"] <-df$place.bounding_box.coordinates[[i]][1,4,2]
}
}
EDIT
Here is an example dataframe with one case (via dput)
structure(list(coordinates.coordinates = list(c(112.088477, -7.227974
)), place.bounding_box.coordinates = list(structure(c(112.044456,
112.044456, 112.143242, 112.143242, -7.263067, -7.134563, -7.134563,
-7.263067), .Dim = c(1L, 4L, 2L)))), .Names = c("coordinates.coordinates",
"place.bounding_box.coordinates"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -1L))
In case it helps, this is the data format that gets out when you try to read Twitter stream data using jsonlite's stream_in function (with flatten=TRUE)
library(dplyr)
df = data_frame(
coordinates.coordinates =
list(c(0, 1), c(2, 3)),
place.bounding_box.coordinates =
list(array(0, dim=c(1, 4, 2)),
array(1, dim=c(1, 4, 2))))
df %>%
rowwise %>%
do(with(., data_frame(
longitude = coordinates.coordinates[1],
latitude = coordinates.coordinates[2]) %>% bind_cols(
place.bounding_box.coordinates %>%
as.data.frame %>%
setNames(c(
"place.bounding_box.a.lon",
"place.bounding_box.b.lon",
"place.bounding_box.c.lon",
"place.bounding_box.d.lon",
"place.bounding_box.a.lat",
"place.bounding_box.b.lat",
"place.bounding_box.c.lat",
"place.bounding_box.d.lat")))))