counting number of observations into a dataframe - r

I want to create a new column in a dataframe that states the number of observations for a particular group.
I have a surgical procedure (HRG.Code) and multiple consultants who perform this procedure (Consultant.Code) and the length of stay their patients are in for in days.
Using
sourceData2$meanvalue<-with(sourceData2,ave(LengthOfStayDays., HRG.Code, Consultant.Code FUN=mean))
I can get a new column (meanvalue) that shows the mean length of stay per consultant per procedure.
This is just what I need. However, I'd also like to know how many occurances of each procedures each consultant performed as a new column in this same data frame.
How do I generate this number of observations. There doesn't appear to be a FUN = Observations or FUN = freq capability.

You may try:
tbl <- table(sourceData2[,3:2]) #gives the frequency of each `procedure` i.e. `HRG.Code` done by every `Consultant.Code`
tbl
# HRG.Code
#Consultant.Code A B C
# A 1 1 0
# B 4 2 1
# C 0 0 1
# D 1 1 1
# E 2 0 0
as.data.frame.matrix(tbl) #converts the `table` to `data.frame`
If you want the total unique procedures done by each Consultant.Code in the long form.
with(sourceData2, as.numeric(ave(HRG.Code, Consultant.Code,
FUN=function(x) length(unique(x)))))
# [1] 3 3 3 2 1 3 3 3 3 1 1 3 3 3 2
data
sourceData2 <- structure(list(LengthofStayDays = c(2L, 2L, 4L, 3L, 4L, 5L, 2L,
4L, 5L, 2L, 4L, 2L, 4L, 4L, 2L), HRG.Code = c("C", "A", "A",
"B", "A", "A", "B", "C", "A", "A", "C", "A", "B", "B", "A"),
Consultant.Code = c("B", "B", "B", "A", "E", "B", "D", "D",
"D", "E", "C", "B", "B", "B", "A")), .Names = c("LengthofStayDays",
"HRG.Code", "Consultant.Code"), row.names = c(NA, -15L), class = "data.frame")

Related

How do you (simply) apply a function to mutliple sub-sets of differing lengths in R? [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 2 years ago.
I need to apply a function to several subsets of data of differing lengths within a column and generate a new data frame which includes the outputs and their associated metadata.
How can I do this without recourse to for loops? tapply() seems like a good place to start, but I struggle with the syntax.
For example -- I have something like this:
block plot id species type response
1 1 1 w a 1.5
1 1 2 w a 1
1 1 3 w a 2
1 1 4 w a 1.5
1 2 5 x a 5
1 2 6 x a 6
1 2 7 x a 7
1 3 8 y b 10
1 3 9 y b 11
1 3 10 y b 9
1 4 11 z b 1
1 4 12 z b 3
1 4 13 z b 2
2 5 14 w a 0.5
2 5 15 w a 1
2 5 16 w a 1.5
2 6 17 x a 3
2 6 18 x a 2
2 6 19 x a 4
2 7 20 y b 13
2 7 21 y b 12
2 7 22 y b 14
2 8 23 z b 2
2 8 24 z b 3
2 8 25 z b 4
2 8 26 z b 2
2 8 27 z b 4
And I want to produce something like this:
block plot species type mean.response
1 1 w a 1.5
1 2 x a 6
1 3 y b 10
1 4 z b 2
2 5 w a 1
2 6 x a 3
2 7 y b 13
2 8 z b 3
Try this. You can use group_by() to set the grouping variables and then summarise() to compute the expected variable. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df %>% group_by(block,plot,species,type) %>% summarise(Mean=mean(response,na.rm=T))
Output:
# A tibble: 8 x 5
# Groups: block, plot, species [8]
block plot species type Mean
<int> <int> <chr> <chr> <dbl>
1 1 1 w a 1.5
2 1 2 x a 6
3 1 3 y b 10
4 1 4 z b 2
5 2 5 w a 1
6 2 6 x a 3
7 2 7 y b 13
8 2 8 z b 3
Or using base R (-3 is used to omit id variable in the aggregation):
#Base R
newdf <- aggregate(response~.,data=df[,-3],mean,na.rm=T)
Output:
block plot species type response
1 1 1 w a 1.5
2 2 5 w a 1.0
3 1 2 x a 6.0
4 2 6 x a 3.0
5 1 3 y b 10.0
6 2 7 y b 13.0
7 1 4 z b 2.0
8 2 8 z b 3.0
Some data used:
#Data
df <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
Use any of these where the input dd is given reproducibly in the Note at the end:
# 1. aggregate.formula - base R
# Can use just response on left hand side if header doesn't matter.
aggregate(cbind(mean.response = response) ~ block + plot + species + type, dd, mean)
# 2. aggregate.default - base R
v <- c("block", "plot", "species", "type")
aggregate(list(mean.response = dd$response), dd[v], mean)
# 3. sqldf
library(sqldf)
sqldf("select block, plot, species, type, avg(response) as [mean.response]
from dd group by 1, 2, 3, 4")
# 4. data.table
library(data.table)
v <- c("block", "plot", "species", "type")
as.data.table(dd)[, .(mean.response = mean(response)), by = v]
# 5. doBy - last column of output will be labelled response.mean
library(doBy)
summaryBy(response ~ block + plot + species + type, dd)
Note
The input in reproducible form:
dd <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))

Shift rows left based on column value

I'm working with a large data frame (30000+ observations with 20 variables) so I can't transpose my data frame. For some rows, some columns are shifted to the right of a Date-class column, but columns to the left of the Date-class column aren't shifted. I tried writing an if statement based on the column where the shift occurs, but I can't seem to wrap my head around it.
Here's some example code:
structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"),
Vial = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L), Date = structure(c(15156, 15156, 15156,
15156, 15156, 15156, 15156, 15156, 15156, 15156, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 15156, 15156, 15156, 15156,
15156, 15156, 15156, 15156, 15156, 15156), class = "Date"),
Value_1 = c("a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "2011-07-01", "2011-07-01", "2011-07-01", "2011-07-01",
"2011-07-01", "2011-07-01", "2011-07-01", "2011-07-01", "2011-07-01",
"2011-07-01", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a"), Value_2 = c("b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b"), Value_3 = c("c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c"), Value_4 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d"
)), row.names = c(NA, -30L), class = "data.frame")
Note that the last column contains NA's but also values.
I urge again that the upstream process should be fixed. In the interim, this hack should work "well-enough" for now.
nadate <- is.na(x$Date)
newdate <- as.Date(x$Value_1[nadate])
newnotna <- !is.na(newdate)
x$Date[nadate] <- newdate[newnotna]
ind <- seq(which(colnames(x) == "Date") + 1L, ncol(x) - 1L)
x[nadate & newnotna, ind] <- x[nadate & newnotna, ind + 1L]
x[nadate & newnotna, ncol(x)] <- NA
x
# Site Vial Date Value_1 Value_2 Value_3 Value_4
# 1 1 1 2011-07-01 a b c <NA>
# 2 1 2 2011-07-01 a b c <NA>
# 3 1 3 2011-07-01 a b c <NA>
# 4 1 4 2011-07-01 a b c <NA>
# 5 1 5 2011-07-01 a b c <NA>
# 6 1 6 2011-07-01 a b c <NA>
# 7 1 7 2011-07-01 a b c <NA>
# 8 1 8 2011-07-01 a b c <NA>
# 9 1 9 2011-07-01 a b c <NA>
# 10 1 10 2011-07-01 a b c <NA>
# 11 2 1 2011-07-01 a b c <NA>
# 12 2 2 2011-07-01 a b c <NA>
# 13 2 3 2011-07-01 a b c <NA>
# 14 2 4 2011-07-01 a b c <NA>
# 15 2 5 2011-07-01 a b c <NA>
# 16 2 6 2011-07-01 a b c <NA>
# 17 2 7 2011-07-01 a b c <NA>
# 18 2 8 2011-07-01 a b c <NA>
# 19 2 9 2011-07-01 a b c <NA>
# 20 2 10 2011-07-01 a b c <NA>
# 21 3 1 2011-07-01 a b c d
# 22 3 2 2011-07-01 a b c d
# 23 3 3 2011-07-01 a b c d
# 24 3 4 2011-07-01 a b c d
# 25 3 5 2011-07-01 a b c d
# 26 3 6 2011-07-01 a b c d
# 27 3 7 2011-07-01 a b c d
# 28 3 8 2011-07-01 a b c d
# 29 3 9 2011-07-01 a b c d
# 30 3 10 2011-07-01 a b c d
This should be stable-enough: if run multiple times on the same data, it should do nothing more. If the $Date column is not NA, then no shift is attempted. If $Value_1 does not parse as a date, nothing is shifted.

Create a dataframe with list elements with dplyr in R

This is my dataframe:
df<-list(structure(list(Col1 = structure(1:6, .Label = c("A", "B",
"C", "D", "E", "F"), class = "factor"), Col2 = structure(c(1L,
2L, 3L, 2L, 4L, 5L), .Label = c("B", "C", "D", "F", "G"), class = "factor")), class = "data.frame", row.names = c(NA,
-6L)), structure(list(Col1 = structure(c(1L, 4L, 5L, 6L, 2L,
3L), .Label = c("A", "E", "H", "M", "N", "P"), class = "factor"),
Col2 = structure(c(1L, 2L, 3L, 2L, 4L, 5L), .Label = c("B",
"C", "D", "F", "G"), class = "factor")), class = "data.frame", row.names = c(NA,
-6L)), structure(list(Col1 = structure(c(1L, 4L, 6L, 5L, 2L,
3L), .Label = c("A", "W", "H", "M", "T", "U"), class = "factor"),
Col2 = structure(c(1L, 2L, 3L, 2L, 4L, 5L), .Label = c("B",
"C", "D", "S", "G"), class = "factor")), class = "data.frame", row.names = c(NA,
-6L)))
I want to extract col1=df[[1]][1] as a dataframe. Then col1 of the second position of this list I want to merge to the df[[1]][1], then I will have a dataframe with 2 columns.
After this I want to merge the column 1 of the third position of the list to the dataframe with two columns, then I will have a dataframe with 3 columns.
In other words my dataframe should have 3 columns, all the first columns of each entry of my list.
The dplyr package can helpme to do this?
Any help?
You can use lapply to extract the three columns named "Col1 in one go. Then set the names of the result.
col1 <- as.data.frame(lapply(df, '[[', "Col1"))
names(col1) <- letters[seq_along(col1)]
col1
# a b c
#1 A A A
#2 B M M
#3 C N U
#4 D P T
#5 E E W
#6 F H H
Choose any other column names that you might find better.
A dplyr way could be
df %>%
unlist(recursive = FALSE) %>%
as.data.frame %>%
select(., starts_with("Col1"))
# Col1 Col1.1 Col1.2
#1 A A A
#2 B M M
#3 C N U
#4 D P T
#5 E E W
#6 F H H
With map_dfc from purrr:
library(purrr)
map_dfc(df, `[`, 1)
Output:
Col1 Col11 Col12
1 A A A
2 B M M
3 C N U
4 D P T
5 E E W
6 F H H
Alternative use of map_dfc making use of purrr's concise element extraction syntax that allows specifying elements of elements by name or position. The first is, for example, equivalent to
map_dfc(df, `[[`, 1)
which differs from the use of [ in that the columns will not be named variations of Col1 and just get V names instead, which may be desirable since names like Col11 and Col12 may be confusing.
df <- list(structure(list(Col1 = structure(1:6, .Label = c("A", "B", "C", "D", "E", "F"), class = "factor"), Col2 = structure(c(1L, 2L, 3L, 2L, 4L, 5L), .Label = c("B", "C", "D", "F", "G"), class = "factor")), class = "data.frame", row.names = c(NA, -6L)), structure(list(Col1 = structure(c(1L, 4L, 5L, 6L, 2L, 3L), .Label = c("A", "E", "H", "M", "N", "P"), class = "factor"), Col2 = structure(c(1L, 2L, 3L, 2L, 4L, 5L), .Label = c("B", "C", "D", "F", "G"), class = "factor")), class = "data.frame", row.names = c(NA, -6L)), structure(list(Col1 = structure(c(1L, 4L, 6L, 5L, 2L, 3L), .Label = c("A", "W", "H", "M", "T", "U"), class = "factor"), Col2 = structure(c(1L, 2L, 3L, 2L, 4L, 5L), .Label = c("B", "C", "D", "S", "G"), class = "factor")), class = "data.frame", row.names = c(NA, -6L)))
library(purrr)
map_dfc(df, 1)
#> # A tibble: 6 x 3
#> V1 V2 V3
#> <fct> <fct> <fct>
#> 1 A A A
#> 2 B M M
#> 3 C N U
#> 4 D P T
#> 5 E E W
#> 6 F H H
map_dfc(df, "Col1")
#> # A tibble: 6 x 3
#> V1 V2 V3
#> <fct> <fct> <fct>
#> 1 A A A
#> 2 B M M
#> 3 C N U
#> 4 D P T
#> 5 E E W
#> 6 F H H
Created on 2018-09-19 by the reprex package (v0.2.0).
res<-1:nrow(df[[1]][1])
for(i in 1:length(df)){
print ( as.vector(df[[i]][1]))
res<-cbind(res,as.data.frame(df[[i]][1]))
}
res$res<-NULL
So, the output is:
Col1 Col1 Col1
1 A A A
2 B M M
3 C N U
4 D P T
5 E E W
6 F H H
Using dplyr
library(dplyr)
df %>%
sapply('[[',1) %>%
as.data.frame
#returns
V1 V2 V3
1 A A A
2 B M M
3 C N U
4 D P T
5 E E W
6 F H H

Preparing data for Gephi

Greeting,
I would need to prepare data for network analysis in Gephi. I have data in the following format:
MY Data
And I need data in format (Where the values represent persons that are connected through the organization):
Required format
Thank you very much!
I think this code should do the job. It is not the best most elegant way of doing it, but it works :)
# Data
x <-
structure(
list(
Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")
),
.Names = c("Persons", "Organizations"),
class = "data.frame",
row.names = c(NA, -11L)
)
# This will merge n:n
edgelist <- merge(x, x, by = "Organizations")[,2:3]
# We don't want autolinks
edgelist <- subset(edgelist, Persons.x != Persons.y)
# Removing those that are repeated
edgelist <- unique(edgelist)
edgelist
#> Persons.x Persons.y
#> 2 1 3
#> 3 1 2
#> 4 3 1
#> 6 3 2
#> 7 2 1
#> 8 2 3
HIH
Created on 2018-01-03 by the reprex
package (v0.1.1.9000).
Starting with x:
structure(list(Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")), .Names = c("Persons", "Organizations"), class = "data.frame", row.names = c(NA,-11L))
Create a new data.frame with different names. Just convert Organizations to a factor and then use the numeric values:
> y=data.frame(Source=x$Persons, Target=as.numeric(as.factor(x$Organizations)))
> y
Source Target
1 1 1
2 1 2
3 1 5
4 2 6
5 2 1
6 2 5
7 2 3
8 3 4
9 3 3
10 3 1
11 3 5
For what it's worth, I'm pretty sure gephi can handle strings.

Turning list into a data.frame

mylist <- list(structure(c(1L, 1L, 2L, 2L, 2L, 2L, NA, NA), .Names = c("A",
"B", "C", "D", "E", "F", "G", "H")), structure(c(1L, 1L, 1L,
1L, 1L, 2L, 1L, NA), .Names = c("A", "B", "C", "D", "E", "F",
"G", "H")))
mylist
[[1]]
A B C D E F G H
1 1 2 2 2 2 NA NA
[[2]]
A B C D E F G H
1 1 1 1 1 2 1 NA
I have a list like above and I want to collapse it into a data.frame so that I can subset each column individually ie df$A, df$B, etc.
> df$A
[1] 1 1
> df$B
[1] 1 1
> df$C
[1] 2 1
And so forth
You could unlist and the split according to the names, something like
temp <- unlist(mylist)
res <- split(unname(temp), names(temp))
# res$A
# [1] 1 1
# res$B
# [1] 1 1
# res$C
# [1] 2 1

Resources