Manipulating all split data sets

Manipulating all split data sets - r

I'm drawing a blank-- I have 51 sets of split data from a data frame that I had, and I want to take the mean of the height of each set.
print(dataset)
$`1`
ID Species Plant Height
1 A 1 42.7
2 A 1 32.5
$`2`
ID Species Plant Height
3 A 2 43.5
4 A 2 54.3
5 A 2 45.7
...
...
...
$`51`
ID Species Plant Height
134 A 51 52.5
135 A 51 61.2
I know how to run each individually, but with 51 split sections, it would take me ages.
I thought that
mean(dataset[,4])
might work, but it says that I have the wrong number of dimensions. I get now why that is incorrect, but I am no closer to figuring out how to average all of the heights.

The dataset is a list. We could use lapply/sapply/vapply etc to loop through the list elements and get the mean of the 'Height' column. Using vapply, we can specify the class and length of the output (numeric(1)). This will be useful for debugging.
vapply(dataset, function(x) mean(x[,4], na.rm=TRUE), numeric(1))
# 1 2 51
#37.60000 47.83333 56.85000
Or another option (if we have the same columnames/number of columns for the data.frames in the list), would be to use rbindlist from data.table with the optionidcol=TRUEto generate a singledata.table. The '.id' column shows the name of thelistelements. We group by '.id' and get themeanof theHeight`.
library(data.table)
rbindlist(dataset, idcol=TRUE)[, list(Mean=mean(Height, na.rm=TRUE)), by = .id]
# .id Mean
#1: 1 37.60000
#2: 2 47.83333
#3: 51 56.85000
Or a similar option as above is unnest from library(tidyr) to return a single dataset with the '.id' column, grouped by '.id', we summarise to get the mean of 'Height'.
library(tidyr)
library(dplyr)
unnest(dataset, .id) %>%
group_by(.id) %>%
summarise(Mean= mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
The syntax for plyr is
df1 <- unnest(dataset, .id)
ddply(df1, .(.id), summarise, Mean=mean(Height, na.rm=TRUE))
# .id Mean
#1 1 37.60000
#2 2 47.83333
#3 51 56.85000
data
dataset <- structure(list(`1` = structure(list(ID = 1:2, Species = c("A",
"A"), Plant = c(1L, 1L), Height = c(42.7, 32.5)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L)), `2` = structure(list(ID = 3:5, Species = c("A", "A", "A"
), Plant = c(2L, 2L, 2L), Height = c(43.5, 54.3, 45.7)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-3L)), `51` = structure(list(ID = 134:135, Species = c("A", "A"
), Plant = c(51L, 51L), Height = c(52.5, 61.2)), .Names = c("ID",
"Species", "Plant", "Height"), class = "data.frame", row.names = c(NA,
-2L))), .Names = c("1", "2", "51"))

This also works, though it uses dplyr.
library(dplyr)
1:length(dataset) %>%
lapply(function(i)
test[[i]] %>%
mutate(section = i ) ) %>%
bind_rows %>%
group_by(section) %>%
summarize(mean_height = mean(height) )

Related

dplyr join with three data frame

I have 3 data frames as like this
df1 <- structure(list(Vehicle = c("Car1", "Car2", "Car8"), Year = c(20L,
21L, 20L), type = c("A", "A", "A")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(Vehicle = c("Car1", "Car2", "Car7"), Year = c(20L,
21L, 90L), type = c("M", "M", "M")), class = "data.frame", row.names = c(NA, -3L))
df3 <- structure(list(Vehicle = c("Car1", "Car2", "Car9"), Year = c(20L,
21L, 92L), type = c("I", "I", "I")), class = "data.frame", row.names = c(NA, -3L))
And I need to make a new table as follows
Vehicle Year type
Car1 20 A/M/I
Car2 21 A/M/I
Car7 90 M
Car8 20 A
Car9 92 I
for this purpose I used this code using dplyr as like this, but it is not working with 3 data frames:
dplyr::full_join(df1, df2, df3, by = c('Vehicle', 'Year')) %>%
tidyr::unite(type, type.x, type.y, sep = '/', na.rm = TRUE)

Try this approach. Instead of merging it looks like you want to combine all dataframes and then aggregate. Here the code using dplyr:
library(dplyr)
#Code
newdf <- bind_rows(df1,df2,df3) %>%
group_by(Vehicle,Year) %>%
summarise(type=paste0(type,collapse='|'))
Output:
# A tibble: 5 x 3
# Groups: Vehicle [5]
Vehicle Year type
<chr> <int> <chr>
1 Car1 20 A|M|I
2 Car2 21 A|M|I
3 Car7 90 M
4 Car8 20 A
5 Car9 92 I

Generally, to merge >2 data.frame's/tibble's you'd use either base R's Reduce or purrr::reduce; for example using the latter:
list(df1, df2, df3) %>%
purrr::reduce(dplyr::full_join, by = c("Vehicle", "Year")) %>%
tidyr::unite(type, dplyr::starts_with("type"), sep = "/", na.rm = TRUE)
# Vehicle Year type
#1 Car1 20 A/M/I
#2 Car2 21 A/M/I
#3 Car8 20 A
#4 Car7 90 M
#5 Car9 92 I

Using base R
aggregate(type ~ Vehicle + Year, rbind(df1, df2, df3) ,
FUN = paste, collapse="|")
-output
# Vehicle Year type
#1 Car1 20 A|M|I
#2 Car8 20 A
#3 Car2 21 A|M|I
#4 Car7 90 M
#5 Car9 92 I

Add two R data frames of different sizes [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
If two data frames are
symbol wgt
1 A 2
2 C 4
3 D 6
symbol wgt
1 A 20
2 D 10
how can I add them so that missing observations for a "symbol" in either data frame are treated as zero, giving
symbol wgt
1 A 22
2 C 4
3 D 16

You can join the two dataframes by symbol , replace NA with 0 and add the two weights.
library(dplyr)
df1 %>%
left_join(df2, by = 'symbol') %>%
mutate(wgt.y = replace(wgt.y, is.na(wgt.y), 0),
wgt = wgt.x + wgt.y) %>%
select(-wgt.x, -wgt.y)
# symbol wgt
#1 A 22
#2 C 4
#3 D 16
data
df1 <- structure(list(symbol = c("A", "C", "D"), wgt = c(2L, 4L, 6L)),
class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(symbol = c("A", "D"), wgt = c(20L, 10L)),
class = "data.frame", row.names = c(NA, -2L))

Try this one line solution by pipes:
#Data
library(dplyr)
df1 <- structure(list(symbol = c("A", "C", "D"), wgt = c(2L, 4L, 6L)), class = "data.frame", row.names = c("1",
"2", "3"))
df2 <- structure(list(symbol = c("A", "D"), wgt = c(20L, 10L)), class = "data.frame", row.names = c("1",
"2"))
#Code
df1 %>% left_join(df2,by = 'symbol') %>% mutate(wgt = rowSums(.[-1],na.rm=T)) %>% select(c(1,4))
symbol wgt
1 A 22
2 C 4
3 D 16

With data.table and the data provided in the answer of #RonakShah and #Duck the solution could be a simple aggregation:
# Convert data.frame to data.table (very fast since inplace)
setDT(df1)
setDT(df2)
# combine both data.frames into one data.frame, group by symbol, apply the sum (NAs are ignored = counted as zero)
rbind(df1,df2)[, sum(wgt, na.rm = TRUE), by = symbol]
# Output
symbol V1
1: A 22
2: C 4
3: D 16
Note: If you want to use base R only (without data.table) you could use aggregate instead:
aggregate(wgt ~ symbol, rbind(df1,df2), sum)

join variable with variable in which many data contains in row in R

I want perform join.
df1=structure(list(id = 1:3, group_id = c(10L, 20L, 40L)), class = "data.frame", row.names = c(NA,
-3L))
df2 has another structure, in group_id's field contain many groups. For examle {10,100,400}
so dput()
df2=structure(list(id = 1:3, group_id = structure(c(1L, 3L, 2L), .Label = c("{`10`,100,`40`}",
"{3,`40`,600,100}", "{4}"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
df2 has group_id 10 and 40,but they are in braces together with other groups.
How get desired joined output
id group_id
1 10
1 40
3 40

You can clean group_id in df2 using gsub, bring each id in separate rows and filter.
library(dplyr)
df2 %>%
mutate(group_id = gsub('[{}`]', '', group_id)) %>%
tidyr::separate_rows(group_id) %>%
filter(group_id %in% df1$group_id)
# id group_id
#1 1 10
#2 1 40
#3 3 40

Here's a data.table alternative:
df2[, strsplit(gsub('[{}`]', '', group_id), ','), by = id][V1 %in% df1$group_id]
# id V1
#1: 1 10
#2: 1 40
#3: 3 40

here is an option with base R using regmatches/regexpr
subset(setNames(stack(setNames(regmatches(df2$group_id, gregexpr("\\d+", df2$group_id)),
df2$id))[2:1], c('id', 'group_id')), group_id %in% df1$group_id)
# id group_id
#1 1 10
#3 1 40
#6 3 40

Merging two df One to Many within List - R

To start I will ignore the use of lists and show what I want using two df's.
I have df1
ID v1 Join_ID
1 100 1
2 110 2
3 150 3
And df2
Join_ID Type v2
1 a 80
1 b 90
2 a 70
2 b 60
3 a 50
3 b 40
I want the df.join to be:
ID v1 Join_ID a_v2 b_v2
1 100 1 80 90
2 110 2 70 60
3 150 3 50 40
I have tried:
df.merged <- merge(df1, df2, by="Join_ID")
df.wide <- dcast(melt(df.merged, id.vars=c("ID", "type")), ID~variable+type)
But this repeats all the variables in df1 for each type: v1_a v1_b
On top of this I have two lists
list.1
df1_a
df1_b
df1_c
list.2
df2_a
df2_b
df2_c
And I want the df1_a in list 1 to join with the df2_a in list 2

We can do this with maping through the list elements and then do the join
library(tidyverse)
map2(list.1, list.2, ~
.y %>%
mutate(Type = paste0(Type, "_v2")) %>%
spread(Type, v2) %>%
inner_join(.x, by = 'Join_ID'))
data
df1 <- structure(list(ID = 1:3, v1 = c(100L, 110L, 150L), Join_ID = 1:3),
.Names = c("ID",
"v1", "Join_ID"), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(Join_ID = c(1L, 1L, 2L, 2L, 3L, 3L), Type = c("a",
"b", "a", "b", "a", "b"), v2 = c(80L, 90L, 70L, 60L, 50L, 40L
)), .Names = c("Join_ID", "Type", "v2"), class = "data.frame", row.names = c(NA,
-6L))
list.1 <- list(df1_a = df1, df1_b = df1, df1_c = df1)
list.2 <- list(df2_a = df2, df2_b = df2, df2_c = df2)

Some replies to your request :
1. the reshaping of df2
2. the join with different column names
library(reshape2)
df1=data.frame(id=c(1,2,3), v1=c(100,110,150))
df2=data.frame(Join_ID=c(1,1,2,2,3,3),Type=c("a","b","a","b","a","b"),v2=c(80,90,70,60,50,40))
cast_df2=dcast(df2, Join_ID ~ Type)
mergedData <- full_join(df1,cast_df2, by=c("id"="Join_ID"),suffixes=c("_df1","_df2") )

Barplot dplyr summarized values

I have data from a top 3 ranking. I'm trying to create a plot that would have on the x axis the column name (cost/product), and the y value be the frequency (ideally relative frequency but I'm not sure how to get that in dplyr).
I'm trying to create this in plotly from values summarized in dplyr. I have a dplyr data frame that looks something like this:
likelyReasonFreq<- LikelyRenew_Reason %>%
filter(year==3)%>%
filter(status==1)%>%
summarize(costC = count(cost),
productsC = count(products))
> likelyReasonFreq
costC.x costC.freq productsC.x productsC.freq
1 1 10 1 31
2 2 11 2 40
3 3 17 3 30
4 NA 149 NA 86
I'm trying to create a barplot that shows the total (summed) frequency for cost,and for products. So frequency for cost would be the frequency for # of times ranked 1, 2, or 3 so 38. Essentially I'm summing rows 1:3 (for products it would be 101 (not including NA values).
I'm not sure how to go about this, any ideas??
below is the variable likelyReasonFreq
> dput(head(likelyReasonFreq))
structure(list(costC = structure(list(x = c(1, 2, 3, NA), freq = c(10L,
11L, 17L, 149L)), .Names = c("x", "freq"), row.names = c(NA,
4L), class = "data.frame"), productsC = structure(list(x = c(1,
2, 3, NA), freq = c(31L, 40L, 30L, 86L)), .Names = c("x", "freq"
), row.names = c(NA, 4L), class = "data.frame")), .Names = c("costC",
"productsC"), row.names = c(NA, 4L), class = "data.frame")
I appreciate any advice!

Your data structure is little awkward to work with, you can do a str or glimpse to it to see the problem, however you may fix this as below and then can plot it.
> str(df)
'data.frame': 4 obs. of 2 variables:
$ costC :'data.frame': 4 obs. of 2 variables:
..$ x : num 1 2 3 NA
..$ freq: int 10 11 17 149
$ productsC:'data.frame': 4 obs. of 2 variables:
..$ x : num 1 2 3 NA
..$ freq: int 31 40 30 86
Code to follow for plotting:
library(ggplot2)
library(tidyverse)
df <- df %>% map(unnest) %>% bind_rows(.id="Name") %>% na.omit() #fixing the structure of column taken as a set of two separate columns
df %>%
ggplot(aes(x=Name, y= freq)) +
geom_col()
I hope this is what is expected, although I am not entirely sure of it.
Input data given:
df <- structure(list(costC = structure(list(x = c(1, 2, 3, NA), freq = c(10L,
11L, 17L, 149L)), .Names = c("x", "freq"), row.names = c(NA,
4L), class = "data.frame"), productsC = structure(list(x = c(1,
2, 3, NA), freq = c(31L, 40L, 30L, 86L)), .Names = c("x", "freq"
), row.names = c(NA, 4L), class = "data.frame")), .Names = c("costC",
"productsC"), row.names = c(NA, 4L), class = "data.frame")
Output:
Added after OP request:
Here, I have not removed the NAs instead I have relplaced with a new value '4'. To take a relative sum across groups, I have used cumsum and then divided by the entire sum across both groups to get the relative frequencies.
df <- df %>% map(unnest) %>% bind_rows(.id="Name")
df[is.na(df$x),"x"] <- 4
df %>%
group_by(Name) %>%
mutate(sum_Freq = sum(freq), cum_Freq = cumsum(freq)) %>%
filter(x == 3) %>%
mutate(new_x = cum_Freq*100/sum_Freq) %>%
ggplot(aes(x=Name, y = new_x)) +
geom_col()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Manipulating all split data sets - r

This also works, though it uses dplyr. library(dplyr) 1:length(dataset) %>% lapply(function(i) test[[i]] %>% mutate(section = i ) ) %>% bind_rows %>% group_by(section) %>% summarize(mean_height = mean(height) )

Related

dplyr join with three data frame

Add two R data frames of different sizes [duplicate]

join variable with variable in which many data contains in row in R

Merging two df One to Many within List - R

Barplot dplyr summarized values

Categories

Resources