I'm working with a large data frame (30000+ observations with 20 variables) so I can't transpose my data frame. For some rows, some columns are shifted to the right of a Date-class column, but columns to the left of the Date-class column aren't shifted. I tried writing an if statement based on the column where the shift occurs, but I can't seem to wrap my head around it.
Here's some example code:
structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"),
Vial = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L), Date = structure(c(15156, 15156, 15156,
15156, 15156, 15156, 15156, 15156, 15156, 15156, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 15156, 15156, 15156, 15156,
15156, 15156, 15156, 15156, 15156, 15156), class = "Date"),
Value_1 = c("a", "a", "a", "a", "a", "a", "a", "a", "a",
"a", "2011-07-01", "2011-07-01", "2011-07-01", "2011-07-01",
"2011-07-01", "2011-07-01", "2011-07-01", "2011-07-01", "2011-07-01",
"2011-07-01", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"a"), Value_2 = c("b", "b", "b", "b", "b", "b", "b", "b",
"b", "b", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a",
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b"), Value_3 = c("c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "b", "b", "b",
"b", "b", "b", "b", "b", "b", "b", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c"), Value_4 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d"
)), row.names = c(NA, -30L), class = "data.frame")
Note that the last column contains NA's but also values.
I urge again that the upstream process should be fixed. In the interim, this hack should work "well-enough" for now.
nadate <- is.na(x$Date)
newdate <- as.Date(x$Value_1[nadate])
newnotna <- !is.na(newdate)
x$Date[nadate] <- newdate[newnotna]
ind <- seq(which(colnames(x) == "Date") + 1L, ncol(x) - 1L)
x[nadate & newnotna, ind] <- x[nadate & newnotna, ind + 1L]
x[nadate & newnotna, ncol(x)] <- NA
x
# Site Vial Date Value_1 Value_2 Value_3 Value_4
# 1 1 1 2011-07-01 a b c <NA>
# 2 1 2 2011-07-01 a b c <NA>
# 3 1 3 2011-07-01 a b c <NA>
# 4 1 4 2011-07-01 a b c <NA>
# 5 1 5 2011-07-01 a b c <NA>
# 6 1 6 2011-07-01 a b c <NA>
# 7 1 7 2011-07-01 a b c <NA>
# 8 1 8 2011-07-01 a b c <NA>
# 9 1 9 2011-07-01 a b c <NA>
# 10 1 10 2011-07-01 a b c <NA>
# 11 2 1 2011-07-01 a b c <NA>
# 12 2 2 2011-07-01 a b c <NA>
# 13 2 3 2011-07-01 a b c <NA>
# 14 2 4 2011-07-01 a b c <NA>
# 15 2 5 2011-07-01 a b c <NA>
# 16 2 6 2011-07-01 a b c <NA>
# 17 2 7 2011-07-01 a b c <NA>
# 18 2 8 2011-07-01 a b c <NA>
# 19 2 9 2011-07-01 a b c <NA>
# 20 2 10 2011-07-01 a b c <NA>
# 21 3 1 2011-07-01 a b c d
# 22 3 2 2011-07-01 a b c d
# 23 3 3 2011-07-01 a b c d
# 24 3 4 2011-07-01 a b c d
# 25 3 5 2011-07-01 a b c d
# 26 3 6 2011-07-01 a b c d
# 27 3 7 2011-07-01 a b c d
# 28 3 8 2011-07-01 a b c d
# 29 3 9 2011-07-01 a b c d
# 30 3 10 2011-07-01 a b c d
This should be stable-enough: if run multiple times on the same data, it should do nothing more. If the $Date column is not NA, then no shift is attempted. If $Value_1 does not parse as a date, nothing is shifted.
Related
I am working with the R programming language.
I have the following dataset ("my_data"):
structure(list(idd = 1:50, group_1 = c("B", "B", "A", "B", "B",
"A", "A", "A", "B", "A", "A", "B", "B", "B", "A", "A", "A", "A",
"B", "B", "A", "B", "A", "B", "A", "B", "B", "A", "B", "B", "B",
"A", "B", "A", "B", "B", "A", "A", "A", "A", "A", "B", "B", "B",
"A", "B", "B", "B", "B", "B"), v1 = c(15.7296737049317, -4.33377704672207,
-0.551850185265, 2.66888122578048, 12.109072642513, 0.0107927293899017,
20.7785032320562, -1.98974382507874, 12.1663703518471, 11.4308702978893,
-0.657500910529805, 5.71376589298221, 3.43820523228653, 19.5939432685761,
25.5605263610222, -0.407964337882465, 19.3057240854025, 9.24554068987809,
-9.6719534905096, 2.44096357354807, 14.6114916050676, 11.4510663104787,
-14.4231132108142, 15.8031868545157, 16.5505199848675, 6.95491162740581,
2.92431767382703, 29.7157201447823, 9.10001319352251, 9.85982748068076,
-1.23456937110154, -3.44130123376206, -5.23155771062088, 5.78031789617826,
23.6092446408098, 27.5379484533487, 25.6836473435279, 22.9675556994775,
7.62403748556388, -2.24150135680706, 6.72187319859928, -14.1245027627225,
6.8620712655661, 26.5987870464572, 11.3095310060752, 20.9588868268958,
14.8934095694391, 2.21089704551347, 27.4355935292935, 9.21612714668934
), group_2 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L)), row.names = c(NA, -50L), class = "data.frame")
head(my_data)
idd group_1 v1 group_2
1 1 B 15.72967370 1
2 2 B -4.33377705 2
3 3 A -0.55185019 3
4 4 B 2.66888123 4
5 5 B 12.10907264 5
6 6 A 0.01079273 6
7 7 A 20.77850323 7
8 8 A -1.98974383 8
9 9 B 12.16637035 9
10 10 A 11.43087030 10
11 11 A -0.65750091 1
12 12 B 5.71376589 2
For this dataset, I want to perform the following steps in "dplyr":
For each grouping of 10 rows, find the sum of "v1" for group_1 = "A" and group_2 = "B"
For each of these groupings, create a new variable ("v2") that is : "A" if sum(group_1 = A) > sum(group_1 = B), "B" if sum(group_1 = A) < sum(group_1 = B) or "0" if sum(group_1 = A) = sum(group_1 = B)
I know how to do this manually in R:
#STEP 1: since my_data has 50 rows, break my_data into 5 groups of 10 rows
rows_1 = my_data[1:10,]
rows_2 = my_data[11:20,]
rows_3 = my_data[21:30,]
rows_4 = my_data[31:40,]
rows_5 = my_data[41:50,]
# STEP 2: find out values of "v2"
library(dplyr)
dplyr_row_1 = data.frame(rows_1 %>% group_by(group_1) %>% summarize(sum = sum(v1)))
dplyr_row_1$v2 = ifelse(dplyr_row_1[1,2] > dplyr_row_1[2,2], "A", ifelse(dplyr_row_1[1,2] < dplyr_row_1[2,2], "B", 0))
dplyr_row_2 = data.frame(rows_2 %>% group_by(group_1) %>% summarize(sum = sum(v1)))
dplyr_row_2$v2 = ifelse(dplyr_row_2[1,2] > dplyr_row_2[2,2], "A", ifelse(dplyr_row_2[1,2] < dplyr_row_2[2,2], "B", 0))
dplyr_row_3 = data.frame(rows_3 %>% group_by(group_1) %>% summarize(sum = sum(v1)))
dplyr_row_3$v2 = ifelse(dplyr_row_3[1,2] > dplyr_row_3[2,2], "A", ifelse(dplyr_row_3[1,2] < dplyr_row_3[2,2], "B", 0))
dplyr_row_4 = data.frame(rows_4 %>% group_by(group_1) %>% summarize(sum = sum(v1)))
dplyr_row_4$v2 = ifelse(dplyr_row_4[1,2] > dplyr_row_4[2,2], "A", ifelse(dplyr_row_4[1,2] < dplyr_row_4[2,2], "B", 0))
dplyr_row_5 = data.frame(rows_5 %>% group_by(group_1) %>% summarize(sum = sum(v1)))
dplyr_row_5$v2 = ifelse(dplyr_row_5[1,2] > dplyr_row_5[2,2], "A", ifelse(dplyr_row_5[1,2] < dplyr_row_5[2,2], "B", 0))
# STEP 3: append "v2" to first 5 files:
rows_1$v2 = dplyr_row_1$v2
rows_2$v2 = dplyr_row_2$v2
rows_3$v2 = dplyr_row_3$v2
rows_4$v2 = dplyr_row_4$v2
rows_5$v2 = dplyr_row_5$v2
# STEP 4: create final file:
final_file = rbind(rows_1,rows_2, rows_3, rows_4, rows_5)
As a result, the final file looks something like this:
idd group_1 v1 group_2 v2
1 1 B 15.72967370 1 B
2 2 B -4.33377705 2 B
3 3 A -0.55185019 3 B
4 4 B 2.66888123 4 B
5 5 B 12.10907264 5 B
6 6 A 0.01079273 6 B
7 7 A 20.77850323 7 B
8 8 A -1.98974383 8 B
9 9 B 12.16637035 9 B
10 10 A 11.43087030 10 B
11 11 A -0.65750091 1 A
My Question: Can someone please show me how to perform Steps 1 to Step 4 in a single "dplyr" command?
Thanks!
Here is alternative method.
library(tidyverse)
df %>%
mutate(group_index = rep(1:(n() /10), each = 10)) %>%
group_by(group_index) %>%
mutate(
v2 = case_when(
sum(v1[group_1 == 'A']) > sum(v1[group_1 == 'B']) ~ 'A',
sum(v1[group_1 == 'A']) < sum(v1[group_1 == 'B']) ~ 'B',
TRUE ~'0')
)
First I'll create a group_index to group every 10 rows together.
Then group_by the relevant columns and calculate sum.
Remove the grouping layer of group_1, since we need to compare the values in A and B.
If the unique length of sum is equal to "1", that means they are the same, then input "0" in column v2. If they are not the same, output the maximum category stored in group_1.
Finally remove the sum column and sort by idd.
This method is able to solve problem with more than two groups in group_1.
The first 20 rows are shown here for example.
library(tidyverse)
df %>%
mutate(group_index = rep(1:(nrow(df)/10), each = 10)) %>%
group_by(group_index, group_1) %>%
mutate(sum = sum(v1)) %>%
group_by(group_index) %>%
mutate(v2 = ifelse(length(unique(sum)) == 1, 0, group_1[which.max(sum)])) %>%
ungroup() %>%
select(-c(sum, group_index))
# A tibble: 20 x 5
idd group_1 v1 group_2 v2
<int> <chr> <dbl> <int> <chr>
1 1 B 15.7 1 B
2 2 B -4.33 2 B
3 3 A -0.552 3 B
4 4 B 2.67 4 B
5 5 B 12.1 5 B
6 6 A 0.0108 6 B
7 7 A 20.8 7 B
8 8 A -1.99 8 B
9 9 B 12.2 9 B
10 10 A 11.4 10 B
11 11 A -0.658 1 A
12 12 B 5.71 2 A
13 13 B 3.44 3 A
14 14 B 19.6 4 A
15 15 A 25.6 5 A
16 16 A -0.408 6 A
17 17 A 19.3 7 A
18 18 A 9.25 8 A
19 19 B -9.67 9 A
20 20 B 2.44 10 A
This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 2 years ago.
I need to apply a function to several subsets of data of differing lengths within a column and generate a new data frame which includes the outputs and their associated metadata.
How can I do this without recourse to for loops? tapply() seems like a good place to start, but I struggle with the syntax.
For example -- I have something like this:
block plot id species type response
1 1 1 w a 1.5
1 1 2 w a 1
1 1 3 w a 2
1 1 4 w a 1.5
1 2 5 x a 5
1 2 6 x a 6
1 2 7 x a 7
1 3 8 y b 10
1 3 9 y b 11
1 3 10 y b 9
1 4 11 z b 1
1 4 12 z b 3
1 4 13 z b 2
2 5 14 w a 0.5
2 5 15 w a 1
2 5 16 w a 1.5
2 6 17 x a 3
2 6 18 x a 2
2 6 19 x a 4
2 7 20 y b 13
2 7 21 y b 12
2 7 22 y b 14
2 8 23 z b 2
2 8 24 z b 3
2 8 25 z b 4
2 8 26 z b 2
2 8 27 z b 4
And I want to produce something like this:
block plot species type mean.response
1 1 w a 1.5
1 2 x a 6
1 3 y b 10
1 4 z b 2
2 5 w a 1
2 6 x a 3
2 7 y b 13
2 8 z b 3
Try this. You can use group_by() to set the grouping variables and then summarise() to compute the expected variable. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df %>% group_by(block,plot,species,type) %>% summarise(Mean=mean(response,na.rm=T))
Output:
# A tibble: 8 x 5
# Groups: block, plot, species [8]
block plot species type Mean
<int> <int> <chr> <chr> <dbl>
1 1 1 w a 1.5
2 1 2 x a 6
3 1 3 y b 10
4 1 4 z b 2
5 2 5 w a 1
6 2 6 x a 3
7 2 7 y b 13
8 2 8 z b 3
Or using base R (-3 is used to omit id variable in the aggregation):
#Base R
newdf <- aggregate(response~.,data=df[,-3],mean,na.rm=T)
Output:
block plot species type response
1 1 1 w a 1.5
2 2 5 w a 1.0
3 1 2 x a 6.0
4 2 6 x a 3.0
5 1 3 y b 10.0
6 2 7 y b 13.0
7 1 4 z b 2.0
8 2 8 z b 3.0
Some data used:
#Data
df <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
Use any of these where the input dd is given reproducibly in the Note at the end:
# 1. aggregate.formula - base R
# Can use just response on left hand side if header doesn't matter.
aggregate(cbind(mean.response = response) ~ block + plot + species + type, dd, mean)
# 2. aggregate.default - base R
v <- c("block", "plot", "species", "type")
aggregate(list(mean.response = dd$response), dd[v], mean)
# 3. sqldf
library(sqldf)
sqldf("select block, plot, species, type, avg(response) as [mean.response]
from dd group by 1, 2, 3, 4")
# 4. data.table
library(data.table)
v <- c("block", "plot", "species", "type")
as.data.table(dd)[, .(mean.response = mean(response)), by = v]
# 5. doBy - last column of output will be labelled response.mean
library(doBy)
summaryBy(response ~ block + plot + species + type, dd)
Note
The input in reproducible form:
dd <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
Greeting,
I would need to prepare data for network analysis in Gephi. I have data in the following format:
MY Data
And I need data in format (Where the values represent persons that are connected through the organization):
Required format
Thank you very much!
I think this code should do the job. It is not the best most elegant way of doing it, but it works :)
# Data
x <-
structure(
list(
Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")
),
.Names = c("Persons", "Organizations"),
class = "data.frame",
row.names = c(NA, -11L)
)
# This will merge n:n
edgelist <- merge(x, x, by = "Organizations")[,2:3]
# We don't want autolinks
edgelist <- subset(edgelist, Persons.x != Persons.y)
# Removing those that are repeated
edgelist <- unique(edgelist)
edgelist
#> Persons.x Persons.y
#> 2 1 3
#> 3 1 2
#> 4 3 1
#> 6 3 2
#> 7 2 1
#> 8 2 3
HIH
Created on 2018-01-03 by the reprex
package (v0.1.1.9000).
Starting with x:
structure(list(Persons = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), Organizations = c("A", "B", "E", "F", "A", "E", "C", "D", "C", "A", "E")), .Names = c("Persons", "Organizations"), class = "data.frame", row.names = c(NA,-11L))
Create a new data.frame with different names. Just convert Organizations to a factor and then use the numeric values:
> y=data.frame(Source=x$Persons, Target=as.numeric(as.factor(x$Organizations)))
> y
Source Target
1 1 1
2 1 2
3 1 5
4 2 6
5 2 1
6 2 5
7 2 3
8 3 4
9 3 3
10 3 1
11 3 5
For what it's worth, I'm pretty sure gephi can handle strings.
mylist <- list(structure(c(1L, 1L, 2L, 2L, 2L, 2L, NA, NA), .Names = c("A",
"B", "C", "D", "E", "F", "G", "H")), structure(c(1L, 1L, 1L,
1L, 1L, 2L, 1L, NA), .Names = c("A", "B", "C", "D", "E", "F",
"G", "H")))
mylist
[[1]]
A B C D E F G H
1 1 2 2 2 2 NA NA
[[2]]
A B C D E F G H
1 1 1 1 1 2 1 NA
I have a list like above and I want to collapse it into a data.frame so that I can subset each column individually ie df$A, df$B, etc.
> df$A
[1] 1 1
> df$B
[1] 1 1
> df$C
[1] 2 1
And so forth
You could unlist and the split according to the names, something like
temp <- unlist(mylist)
res <- split(unname(temp), names(temp))
# res$A
# [1] 1 1
# res$B
# [1] 1 1
# res$C
# [1] 2 1
I want to create a new column in a dataframe that states the number of observations for a particular group.
I have a surgical procedure (HRG.Code) and multiple consultants who perform this procedure (Consultant.Code) and the length of stay their patients are in for in days.
Using
sourceData2$meanvalue<-with(sourceData2,ave(LengthOfStayDays., HRG.Code, Consultant.Code FUN=mean))
I can get a new column (meanvalue) that shows the mean length of stay per consultant per procedure.
This is just what I need. However, I'd also like to know how many occurances of each procedures each consultant performed as a new column in this same data frame.
How do I generate this number of observations. There doesn't appear to be a FUN = Observations or FUN = freq capability.
You may try:
tbl <- table(sourceData2[,3:2]) #gives the frequency of each `procedure` i.e. `HRG.Code` done by every `Consultant.Code`
tbl
# HRG.Code
#Consultant.Code A B C
# A 1 1 0
# B 4 2 1
# C 0 0 1
# D 1 1 1
# E 2 0 0
as.data.frame.matrix(tbl) #converts the `table` to `data.frame`
If you want the total unique procedures done by each Consultant.Code in the long form.
with(sourceData2, as.numeric(ave(HRG.Code, Consultant.Code,
FUN=function(x) length(unique(x)))))
# [1] 3 3 3 2 1 3 3 3 3 1 1 3 3 3 2
data
sourceData2 <- structure(list(LengthofStayDays = c(2L, 2L, 4L, 3L, 4L, 5L, 2L,
4L, 5L, 2L, 4L, 2L, 4L, 4L, 2L), HRG.Code = c("C", "A", "A",
"B", "A", "A", "B", "C", "A", "A", "C", "A", "B", "B", "A"),
Consultant.Code = c("B", "B", "B", "A", "E", "B", "D", "D",
"D", "E", "C", "B", "B", "B", "A")), .Names = c("LengthofStayDays",
"HRG.Code", "Consultant.Code"), row.names = c(NA, -15L), class = "data.frame")