average column values across all rows of a data frame

average column values across all rows of a data frame - r

I've got a data frame that I read from a file like this:
name, points, wins, losses, margin
joe, 1, 1, 0, 1
bill, 2, 3, 0, 4
joe, 5, 2, 5, -2
cindy, 10, 2, 3, -2.5
etc.
I want to average out the column values across all rows of this data, is there an easy way to do this in R?
For example, I want to get the average column values for all "Joe's", coming out with something like
joe, 3, 1.5, 2.5, -.5

After loading your data:
df <- structure(list(name = structure(c(3L, 1L, 3L, 2L), .Label = c("bill", "cindy", "joe"), class = "factor"), points = c(1L, 2L, 5L, 10L), wins = c(1L, 3L, 2L, 2L), losses = c(0L, 0L, 5L, 3L), margin = c(1, 4, -2, -2.5)), .Names = c("name", "points", "wins", "losses", "margin"), class = "data.frame", row.names = c(NA, -4L))
Just use the aggregate function:
> aggregate(. ~ name, data = df, mean)
name points wins losses margin
1 bill 2 3.0 0.0 4.0
2 cindy 10 2.0 3.0 -2.5
3 joe 3 1.5 2.5 -0.5

Obligatory plyr and reshape solutions:
library(plyr)
ddply(df, "name", function(x) mean(x[-1]))
library(reshape)
cast(melt(df), name ~ ..., mean)

And a data.table solution for easy syntax and memory efficiency
library(data.table)
DT <- data.table(df)
DT[,lapply(.SD, mean), by = name]

I have yet another way.
I show it on other example.
If we have matrix xt as:
a b c d
A 1 2 3 4
A 5 6 7 8
A 9 10 11 12
A 13 14 15 16
B 17 18 19 20
B 21 22 23 24
B 25 26 27 28
B 29 30 31 32
C 33 34 35 36
C 37 38 39 40
C 41 42 43 44
C 45 46 47 48
One can compute mean for duplicated columns in few steps:
1. Compute mean using aggregate function
2. Make two modifications: aggregate writes rownames as new (first) column so you have to define it back as a rownames...
3.... and remove this column, by selecting columns 2:number of columns of xa object.
xa=aggregate(xt,by=list(rownames(xt)),FUN=mean)
rownames(xa)=xa[,1]
xa=xa[,2:5]
After that we get:
a b c d
A 7 8 9 10
B 23 24 25 26
C 39 40 41 42

You can simply use functions from the tidyverse to group your data by name, and then summarise all remaining columns by a given function (eg. mean):
df <- tibble(name=c("joe","bill","joe","cindy"),
points=c(1,2,5,10), wins=c(1,3,2,2),
losses=c(0,0,5,3),
margin=c(1,4,-2, -2.5))
df %>% dplyr::group_by(name) %>% dplyr::summarise_all(mean)

Related

Coding data and changing the column names in another columns using codes in R and tidyverse

below, I have demonstrated part of my data:
df<-read.table(text=" K G M
12 2345 Gholi
KAM 2345 KAM
Noghl 1990 KAM
Zae 1990 441
12 2345 441
KAM 1990 12
Noghl 1800 12"
,header=TRUE)
I would like to make codes for K, G and M and starting with 1. We have 4 groups in K, so 1,2,3 and 4. for G, start with 5, so 5, 6 and 7 as we have three subgroups.
Using the following codes, I will get the following table:
df = lapply(df, function(x) as.integer(as.factor(x)))
data.frame(Map("+", df, cumsum(c(0, head(sapply(df, max), -1)))))
I will get the following table:
KM KN KZ
1 7 10
2 7 11
3 6 11
4 6 9
1 7 9
2 6 8
3 5 8
Now I want to get the following table:
Group C
K,M 1,8
K,M 2,11
K 3
K 4
G 7
G 6
M 10
M 9
For example, in Group, Column K (12), M (12,12) goes to Codes 1 and 8, as they coded in KM and KZ and so

After converting all the columns to integer values based on factor route, and adding the max value of previous columns to the current, we pivot to 'long' format with pivot_longer, bind with the original column values reshaped to 'long' format, grouped by the original value column 'origval', paste the unique elements of other columns
library(dplyr)
library(tidyr)
df %>%
mutate_all(~ as.integer(factor(.))) %>%
mutate(G = max(K) + G, M = max(G) + M) %>%
pivot_longer(everything()) %>%
bind_cols(df %>%
mutate_all(as.character) %>%
pivot_longer(everything(),values_to = 'origvalue') %>%
dplyr::select(-name)) %>%
group_by(origvalue) %>%
summarise_at(vars(-group_cols()), ~toString(unique(.))) %>%
dplyr::select(Group = name, C = value)
# A tibble: 9 x 2
# Group C
# <chr> <chr>
#1 K, M 1, 8
#2 G 5
#3 G 6
#4 G 7
#5 M 9
#6 M 10
#7 K, M 2, 11
#8 K 3
#9 K 4
data
df <- structure(list(K = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("12",
"KAM", "Noghl", "Zae"), class = "factor"), G = c(2345L, 2345L,
1990L, 1990L, 2345L, 1990L, 1800L), M = structure(c(3L, 4L, 4L,
2L, 2L, 1L, 1L), .Label = c("12", "441", "Gholi", "KAM"),
class = "factor")), class = "data.frame", row.names = c(NA,
-7L))

Reduce repeated pivoting to a single pivot

Using tidyr >= 1.0.0, one can use tidy selection in the cols argument as follows:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols=starts_with("DL_TM"),
names_to = "TM",values_to = "DM_TM") %>%
pivot_longer(cols=starts_with("DL_CD"),
names_to = "CD",values_to = "DL_CD") %>%
na.omit() %>%
select(-TM,-CD)
However, the above will quickly get cumbersome(repetitive) with many columns, how can one reduce this to single pivoting?! I have imagined something conceptual like
pivot_longer(cols=starts_with("DL_TM | DL_CD")....) which will obviously not work because tidy selection only works for a single pattern(as far as I know).
Data
df <- structure(list(DL_TM1 = c(16L, 18L, 53L, 59L, 29L, 3L), DL_CD1 = c("AK",
"RB", "RA", "AJ", "RA", "RS"), DL_TM2 = c(5L, 4L, 8L, NA, 1L,
NA), DL_CD2 = c("CN", "AJ", "RB", NA, "AJ", NA), DL_TM3 = c(NA,
NA, 2L, NA, NA, NA), DL_CD3 = c(NA, NA, "AJ", NA, NA, NA), DL_TM4 = c(NA,
NA, NA, NA, NA, NA), DL_CD4 = c(NA, NA, NA, NA, NA, NA), DL_TM5 = c(NA,
NA, NA, NA, NA, NA), DL_CD5 = c(NA, NA, NA, NA, NA, NA), DEP_DELAY_TM = c(21L,
22L, 63L, 59L, 30L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
Expected Output:
Same as the above but with single pivoting.

Based on the response to the comment that this was moved from the code in the question does not actually produce the desired result and what was wanted was the result that this produces:
df %>%
pivot_longer(-DEP_DELAY_TM, names_to = c(".value", "X"),
names_pattern = "(\\D+)(\\d)") %>%
select(-X) %>%
drop_na
giving:
# A tibble: 11 x 3
DEP_DELAY_TM DL_TM DL_CD
<int> <int> <chr>
1 21 16 AK
2 21 5 CN
3 22 18 RB
4 22 4 AJ
5 63 53 RA
6 63 8 RB
7 63 2 AJ
8 59 59 AJ
9 30 29 RA
10 30 1 AJ
11 3 3 RS
Base R
We can alternately do this using base R's reshape. First split the column names (except the last column) by the non-digit parts giving the varying list and then reshape df to long form using that and finally run na.omit to remove the rows with NAs.
nms1 <- head(names(df), -1)
varying <- split(nms1, gsub("\\d", "", nms1))
na.omit(reshape(df, dir = "long", varying = varying, v.names = names(varying)))
giving:
DEP_DELAY_TM time DL_CD DL_TM id
1.1 21 1 AK 16 1
2.1 22 1 RB 18 2
3.1 63 1 RA 53 3
4.1 59 1 AJ 59 4
5.1 30 1 RA 29 5
6.1 3 1 RS 3 6
1.2 21 2 CN 5 1
2.2 22 2 AJ 4 2
3.2 63 2 RB 8 3
5.2 30 2 AJ 1 5
3.3 63 3 AJ 2 3

We can extract the column groupings ("TM" and "CD" in this case), map over each column group to apply pivot_longer to that group, and then full_join the resulting list elements. Let me know if this covers your real-world use case.
suffixes = unique(gsub(".*_(.{2})[0-9]*", "\\1", names(df)))
df.long = suffixes %>%
map(~ df %>%
mutate(id=1:n()) %>% # Ensure unique identification of each original data row
select(id, DEP_DELAY_TM, starts_with(paste0("DL_",.x))) %>%
pivot_longer(cols=-c(DEP_DELAY_TM, id),
names_to=.x,
values_to=paste0(.x,"_value")) %>%
na.omit() %>%
select(-matches(paste0("^",.x,"$")))
) %>%
reduce(full_join) %>%
select(-id)
DEP_DELAY_TM TM_value CD_value
1 21 16 AK
2 21 16 CN
3 21 5 AK
4 21 5 CN
5 22 18 RB
6 22 18 AJ
7 22 4 RB
8 22 4 AJ
9 63 53 RA
10 63 53 RB
11 63 53 AJ
12 63 8 RA
13 63 8 RB
14 63 8 AJ
15 63 2 RA
16 63 2 RB
17 63 2 AJ
18 59 59 AJ
19 30 29 RA
20 30 29 AJ
21 30 1 RA
22 30 1 AJ
23 3 3 RS

Normalize multiple values using values of one factor in R

We have some tidy data with treatments (multiple samples and control), time points, and measured values. I want to normalize all the samples by dividing by the corresponding time point in the control variable.
I know how I would do this with each value in its own column, but can't figure out how to us a combination of gather mutate, sumamrise etc from tidyr or dplyr to do this in a straightforward way.
Here is a sample data frame definition:
structure(list(time = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 100, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -9L), class = "data.frame")
Data frame looks like this:
time value treat
1 10 c
2 20 c
3 15 c
1 100 t1
2 210 t1
3 180 t1
1 110 t2
2 180 t2
3 140 t2
Expected output. same but with normvalue column containing c(1,1,1,10,10.5,12,11,9,9.333333)
I'd like to get out columns of normalized value for each treatment and time point using tidyverse procedures...

If you group by time (assuming that, as in the example, it is the grouping variable for time-point) then we can use bracket notation in a mutate statement to search only within the group. We can use that to access the control value for each group and then divide the un-normalized value by that:
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 9 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 1 100 t1 10
5 2 210 t1 10.5
6 3 180 t1 12
7 1 110 t2 11
8 2 180 t2 9
9 3 140 t2 9.33
All this does is take the value column of each row and divide it by the value for the control sample with the same time value. As you can see, it doesn't care if sample t1 is missing an observation for time == 1:
df <- structure(list(time = c(1, 2, 3, 2, 3, 1, 2, 3),
value = c(10, 20, 15, 210, 180, 110, 180, 140),
as.factor.treat. = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
.Label = c("c", "t1", "t2"), class = "factor")),
.Names = c("time", "value", "treat"),
row.names = c(NA, -8L), class = "data.frame")
df %>%
group_by(time) %>%
mutate(value.norm = value / value[treat == 'c'])
# A tibble: 8 x 4
# Groups: time [3]
time value treat value.norm
<dbl> <dbl> <fct> <dbl>
1 1 10 c 1
2 2 20 c 1
3 3 15 c 1
4 2 210 t1 10.5
5 3 180 t1 12
6 1 110 t2 11
7 2 180 t2 9
8 3 140 t2 9.33

Transform a dataframe to use first column values as column names

I have a dataframe with 2 columns:
.id vals
1 A 10
2 B 20
3 C 30
4 A 100
5 B 200
6 C 300
dput(tst_df)
structure(list(.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor"), vals = c(10, 20, 30, 100, 200,
300)), .Names = c(".id", "vals"), row.names = c(NA, -6L), class = "data.frame")
Now i want to have the .id column to become my column names and the vals will become 2 rows.
Like this:
A B C
10 20 30
100 200 300
Basically .id is my grouping variable and i want to have all values belonging to 1 group as a row. I expected something simple like melt and transform. But after many tries i still not succeeded. Is anyone familiar with a function that will accomplish this?

You can do this in base R with unstack:
unstack(df, form=vals~.id)
A B C
1 10 20 30
2 100 200 300
The first argument is the name of the data.frame and the second is a formula which determines the unstacked structure.

You can also use tapply,
do.call(cbind, tapply(df$vals, df$.id, I))
# A B C
#[1,] 10 20 30
#[2,] 100 200 300
or wrap it in data frame, i.e.
as.data.frame(do.call(cbind, tapply(df$vals, df$.id, I)))

R: Check if value from dataframe is within range other dataframe

I am looking for a way to look up infromation from 1 dataframe in another dataframe, get a value from that other dataframe and pass it back to the first frame..
example data:
I've got a dataframe named "x"
x <- structure(list(from = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L
), to = c(2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L, 6L), number = c(30,
30, 30, 33, 34, 35, 36, 37, 38, 39), name = c("region 1", "region 2",
"region 3", "region 4", "region 5", "region 6", "region 7", "region 8",
"region 9", "region 10")), .Names = c("from", "to", "number",
"name"), row.names = c(NA, -10L), class = "data.frame")
# from to number name
#1 1 2 30 region 1
#2 2 3 30 region 2
#3 3 4 30 region 3
#4 4 5 33 region 4
#5 5 6 34 region 5
#6 1 2 35 region 6
#7 2 3 36 region 7
#8 3 4 37 region 8
#9 4 5 38 region 9
#10 5 6 39 region 10
This dataframe holds information about certain regions (1-10)
I've got another dataframe "y"
y <- structure(list(location = c(1.5, 2.8, 10, 3.5, 2), id_number =
c(30, 30, 38, 40, 36)), .Names = c("location", "id_number"), row.names
= c(NA, -5L), class = "data.frame")
# location id_number
#1 1.5 30
#2 2.8 30
#3 10.0 38
#4 3.5 40
#5 2.0 36
This one containt information about locations.
What I need is a function (or command, or whatever I can throw at R ;-) ) that:
for every row in y: looks up if the y$location fits between x$from and x$to AND y$id_number == x$number.
If a match is found (a location y can only fall in 1 row of x, or in 0. it is impossible for y to exist in two rows in y), return x$name to a new column in y, named "name
desired output:
# location id_number name
#1 1.5 30 region 1
#2 2.8 30 region 2
#3 10.0 38 <NA>
#4 3.5 40 <NA>
#5 2.0 36 region 7
I'm pretty new to R, so my first idea was to use for-loops to tackle this problem (as I'm used to do in VB). But then I thought: "noooooo", I have to verctorise it, like all the people are telling me good R-programmers do ;-)
So I came up with a function, and called it with adply (from the plyr-package).
Problem is: It does not work, throws me an error I don't understand, and now I'm stuck...
Can anyone point me in the right direction?
require("dplyr")
getValue <- function(y, x) {
tmp <- x %>%
filter(from <= y$location, to > y$location, number == y$id_number)
return(tmp$name)
}
y["name"] <- adply(y, 1, getValue, x=x)

Another base method (mostly):
# we need this for the last line - if you don't use magrittr, just wrap the sapply around the lapply
library(magrittr)
# get a list of vectors where each item is whether an item's location in y is ok in each to/from in x
locationok <- lapply(y$location, function(z) z >= x$from & z <= x$to)
# another list of logical vectors indicating whether y's location matches the number in x
idok <- lapply(y$id_number, function(z) z== x$number)
# combine the two list and use the combined vectors as an index on x$name
lapply(1:nrow(y), function(i) {
x$name[ locationok[[i]] & idok[[i]] ]
}) %>%
# replace zero length strings with NA values
sapply( function(x) ifelse(length(x) == 0, NA, x)

Here's a simple base method that uses the OP's logic:
f <- function(vec, id) {
if(length(.x <- which(vec >= x$from & vec <= x$to & id == x$number))) .x else NA
}
y$name <- x$name[mapply(f, y$location, y$id_number)]
y
# location id_number name
#1 1.5 30 region 1
#2 2.8 30 region 2
#3 10.0 38 <NA>
#4 3.5 40 <NA>
#5 2.0 36 region 7

Since you want to match the columns of id_number and number, you can join x and y on the columns and then mutate the name to NA if the location doesn't fall between from and to, here is a dplyr option:
library(dplyr)
y %>% left_join(x, by = c("id_number" = "number")) %>%
mutate(name = if_else(location >= from & location <= to, as.character(name), NA_character_)) %>%
select(-from, -to) %>% arrange(name) %>%
distinct(location, id_number, .keep_all = T)
# location id_number name
# 1 1.5 30 region 1
# 2 2.8 30 region 2
# 3 2.0 36 region 7
# 4 10.0 38 <NA>
# 5 3.5 40 <NA>