Multiple aggregation with unspecified FUN in R - r

I have a data.frame object in R and need to:
Group by col_1
Select rows from col_3 such that col_2 value is the second largest one (if there is only observation for the given value of col_1, return 'NA' for instance).
How can I obtain this?
Example:
scored xg first_goal scored_mane
1 1 1.03212 Lallana 0
2 1 2.06000 Mane 1
3 2 2.38824 Robertson 1
4 2 1.64291 Mane 1
Group by "scored_mane", return values from "scored" where "xg" is the second largest. Expected output: "NA", 1

You can try the following base R solution, using aggregate + merge
res <- merge(aggregate(xg~scored_mane,df,function(v) sort(v,decreasing = T)[2]),df,all.x = TRUE)[,"scored"]
such that
> res
[1] NA 1
DATA
structure(list(scored = c(1L, 1L, 2L, 2L), xg = c(1.03212, 2.06,
2.38824, 1.64291), first_goal = c("Lallana", "Mane", "Robertson",
"Mane"), scored_mane = c(0L, 1L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4")) -> df

Related

How to order contingency table based on data order?

Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]

Using lapply to group list of data frames by column

I have a list that contains multiple data frames. I would like to sort the data by Category (A) and sum the Frequencies (B) using the lapply-command.
The data is df_list
df_list
$`df.1`
A B
1 Apples 2
2 Pears 5
3 Apples 6
4 Pears 1
5 Apples 3
$`df.2`
A B
1 Oranges 2
2 Pineapples 5
3 Oranges 6
4 Pineapples 1
5 Oranges 3
The desired outcome df_list_2 looks like this:
df_list_2
$`df.1`
A B
1 Apples 11
2 Pears 6
$`df.2`
A B
1 Oranges 11
2 Pineapples 6
I have tried the following code based on lapply:
df_list_2<-df_list[, lapply(B, sum), by = A]
However, I get an error code, saying that A was not found.
Either I mistake how the lapply command works in this case or my understating of how it should work is flawed.
Any help much appreciated.
You need to aggregate in lapply
lapply(df_list, function(x) aggregate(B~A, x, sum))
#[[1]]
# A B
#1 Apples 11
#2 Pears 6
#[[2]]
# A B
#1 Oranges 11
#2 Pineapples 6
Using map from purrr and dplyr it would be
library(dplyr)
purrr::map(df_list, ~.x %>% group_by(A) %>% summarise(sum = sum(B)))
data
df_list <- list(structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("Apples", "Pears"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)),
class = "data.frame", row.names = c("1", "2", "3", "4", "5")),
structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L), .Label = c("Oranges",
"Pineapples"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5")))
I fear you might not have a clear idea of lapply nor the extract operator ([). Remember lapply(list, function) applies the specified function you give it to each element of the list you give it. Extract gives you the element you specify:
x <- c('a', 'b', 'c')
x[2]
## "b"
I would imagine that somewhere in your R workspace you have an object names B which is why you didn't get an error along the lines of
## Error in lapply(B, sum) : object 'B' not found
Conversely if you had (accidentally or intentionally) defined both A and B you would see the error
## Error in df_list[, lapply(B, sum), by = A] : incorrect number of dimensions
because that's not at all how to use [; remember, you just pass indexes or booleans to [ along with the occasional optional argument, but by is not one of those.
So without further adieu, here's how I would do this (in base R):
# make some data
a <- c(1, 2, 1, 2, 1)
b <- c(2, 5, 6, 1, 3)
df_list <- list(df.1 = data.frame(A = c('Apples', 'Pears')[a], B = b),
df.2 = data.frame(A = c('Oranges', 'Pineapples')[a], B = b))
# simplify it
df_list_2 <- lapply(df_list, function(x) {
aggregate(list(B = x$B), list(A = x$A), sum)
})
# the desired result
df_list_2
## $df.1
## A B
## 1 Apples 11
## 2 Pears 6
##
## $df.2
## A B
## 1 Oranges 11
## 2 Pineapples 6
You can take advantage of the fact that a data.frame is just a list and shorten up your code like this:
df_list_2 <- lapply(df_list, function(x) {
aggregate(x['B'], x['A'], sum)
})
but the first way of writing it should help make more clear what we're doing
The data.table syntax in OP's post can changed to
library(data.table)
lapply(df_list, function(x) as.data.table(x)[, .(B = sum(B)), by = A])
#$df.1
# A B
#1: Apples 11
#2: Pears 6
#$df.2
# A B
#1: Oranges 11
#2: Pineapples 6
data
df_list <- list(df.1 = structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L
), .Label = c("Apples", "Pears"), class = "factor"), B = c(2L,
5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5")), df.2 = structure(list(A = structure(c(1L, 2L,
1L, 2L, 1L), .Label = c("Oranges", "Pineapples"), class = "factor"),
B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))

How to change the decimals in my lsmeans output

Forgive my lack of R knowledge. I am running some statistics, however I have some problems with the number of decimals in the output. The table I use is simple, inlcuding 2 colums of 'text' and two 'numeric'. The table shows 5 digits (3 decimals). However when working with this table R studio only gives 1 decimal. Not only in my lsmeans results but already in my head(X).
I already tried the following (where X is data):
>format(X, digits=5)
>format(X, decimals=3)
>print(lsmeans,decimals=3)
>options(digits = 5)
However the columns N and Dm are still rounded to 1 decimal.
> N 92.4 92.4 93.7 .....
> Dm 44.8 51.2 49.0 ....
> lsmean 92.7 93.3 92.2
I would like to see the columns N and Dm with 3 decimals like (I see them at the table when used view(x)), and likewise the results of N of the lsmean.
Example data:
X <- structure(list(Diet = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
.Label = c("1",
"2", "3", "4"),
class = c("ordered", "factor")),
Room = structure(c(1L,
1L, 1L, 1L, 1L, 2L),
.Label = c("1", "2"), class = c("ordered",
"factor")),
Ndigestibility = c(92.3961026914675, 91.3131265857907,
93.7094576131358, 93.1557358031795,
91.6853770290382, 93.2698082975574),
Dmdigestibility = c(44.7692224966736, 51.2173172537712,
49.0100980168149, 45.6289084300095,
45.9036710781654, 45.3144774487225)),
row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
X
# Diet Room Ndigestibility Dmdigestibility
# 1 1 1 92.39610 44.76922
# 2 1 1 91.31313 51.21732
# 3 1 1 93.70946 49.01010
# 4 1 1 93.15574 45.62891
# 5 1 1 91.68538 45.90367
# 6 1 2 93.26981 45.31448
You can do
emm_options(opt.digits = FALSE)
This will disable the feature in the emmeans package (for which lsmeans is a front end) whereby results are displayed in reasonable precision relative to their standard errors.

r creating an adjacency matrix from columns in a dataframe

I am interested in testing some network visualization techniques but before trying those functions I want to build an adjacency matrix (from, to) using the dataframe which is as follows.
Id Gender Col_Cold_1 Col_Cold_2 Col_Cold_3 Col_Hot_1 Col_Hot_2 Col_Hot_3
10 F pain sleep NA infection medication walking
14 F Bump NA muscle NA twitching flutter
17 M pain hemoloma Callus infection
18 F muscle pain twitching medication
My goal is to create an adjacency matrix as follows
1) All values in columns with keyword Cold will contribute to the rows
2) All values in columns with keyword Hot will contribute to the columns
For example, pain, sleep, Bump, muscle, hemaloma are cell values under the columns with keyword Cold and they will form the rows and cell values such as infection, medication, Callus, walking, twitching, flutter are under columns with keywords Hot and this will form the columns of the association matrix.
The final desired output should appear like this:
infection medication walking twitching flutter Callus
pain 2 2 1 1 1
sleep 1 1 1
Bump 1 1
muscle 1 1
hemaloma 1 1
[pain, infection] = 2 because the association between pain and infection occurs twice in the original dataframe: once in row 1 and again in row 3.
[pain, medication]=2 because association between pain and medication occurs twice once in row 1 and again in row 4.
Any suggestions or advice on producing such an association matrix is much appreciated thanks.
Reproducible Dataset
df = structure(list(id = c(10, 14, 17, 18), Gender = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), Col_Cold_1 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "Bump", "muscle", "pain"), class = "factor"), Col_Cold_2 = structure(c(4L, 2L, 3L, 1L), .Label = c("", "NA", "pain", "sleep"), class = "factor"), Col_Cold_3 = structure(c(1L, 3L, 2L, 4L), .Label = c("NA", "hemaloma", "muscle", "pain" ), class = "factor"), Col_Hot_1 = structure(c(4L, 3L, 2L, 1L), .Label = c("", "Callus", "NA", "infection"), class = "factor"), Col_Hot_2 = structure(c(2L, 3L, 1L, 3L), .Label = c("infection", "medication", "twitching"), class = "factor"), Col_Hot_3 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "flutter", "medication", "walking" ), class = "factor")), .Names = c("id", "Gender", "Col_Cold_1", "Col_Cold_2", "Col_Cold_3", "Col_Hot_1", "Col_Hot_2", "Col_Hot_3" ), row.names = c(NA, -4L), class = "data.frame")
One way is to make the dataset into a "tidy" form, then use xtabs. First, some cleaning up:
df[] <- lapply(df, as.character) # Convert factors to characters
df[df == "NA" | df == "" | is.na(df)] <- NA # Make all blanks NAs
Now, tidy the dataset:
library(tidyr)
library(dplyr)
out <- do.call(rbind, sapply(grep("^Col_Cold", names(df), value = T), function(x){
vars <- c(x, grep("^Col_Hot", names(df), value = T))
setNames(gather_(select(df, one_of(vars)),
key_col = x,
value_col = "value",
gather_cols = vars[-1])[, c(1, 3)], c("cold", "hot"))
}, simplify = FALSE))
The idea is to "pair" each of the "cold" columns with each of the "hot" columns to make a long dataset. out looks like this:
out
# cold hot
# 1 pain infection
# 2 Bump <NA>
# 3 <NA> Callus
# 4 muscle <NA>
# 5 pain medication
# ...
Finally, use xtabs to make the desired output:
xtabs(~ cold + hot, na.omit(out))
# hot
# cold Callus flutter infection medication twitching walking
# Bump 0 1 0 0 1 0
# hemaloma 1 0 1 0 0 0
# muscle 0 1 0 1 2 0
# pain 1 0 2 2 1 1
# sleep 0 0 1 1 0 1

replace column values in data.frame with character string depending on column value, in R

I have a data.frame 'data', where one column contains integer values between 1:100, which are coded values for the Isolate they represent.
Here's my example data, 'data':
Size Isolate spin
1 primary 3 up
2 primary 4 down
3 sec 6 strange
4 ter 1 charm
5 sec 3 bottom
6 quart 2 top
I have another data.frame that contains the key between the integers and the name of the Isolate
1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf
This list is 100 Isolates in length, too much to type in by hand with if/else.
I'd like to know an easy solution to replacing the integers in my first data.frame, whic aren't in ascending order as you can see, with the corresponding Isolate names in the second data.frame.
I tried, after researching:
data$Isolate <- as.numeric(factor(data$Isolate,
levels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf")
)
)
but this just replaced the Isolate column with N/A.
As Hubert said in the comments, this is a simple use-case for merge.
Let's say the column names of your second "key" data frame are "Isolate" and "Isolate_Name", then it's as easy as
merge(data, key_data, by = "Isolate")
The default is for an "inner join" which will only keep records that have matches. If you're worried about losing records that don't have matches you can add the argument all.x = TRUE.
If you prefer non-base packages, this is easy in data.table or dplyr as well.
Using factor, you could try:
data$Isolate <- factor(data$Isolate,
levels=1:7,
labels =c("alpha","bravo","charlie","delta","echo","foxtrot","golf"))
If you have many levels that are already in their own data.frame, you could automate this.
data$Isolate <- factor(data$Isolate,levels=code$No,labels=code$Value)
With your second data.frame, code:
code <- read.table(text="1 alpha
2 bravo
3 charlie
4 delta
5 echo
6 foxtrot
7 golf",stringsAsFactor=FALSE)
names(code) <- c("No","Value")
df$Isolate <- df2[,1][df$Isolate]
# Size Isolate spin
# 1 primary charlie up
# 2 primary delta down
# 3 sec foxtrot strange
# 4 ter alpha charm
# 5 sec charlie bottom
# 6 quart bravo top
You can subset the lookup data frame by the target data frame.
Data
df <- structure(list(Size = structure(c(1L, 1L, 3L, 4L, 3L, 2L), .Label = c("primary",
"quart", "sec", "ter"), class = "factor"), Isolate = c(3L, 4L,
6L, 1L, 3L, 2L), spin = structure(c(6L, 3L, 4L, 2L, 1L, 5L), .Label = c("bottom",
"charm", "down", "strange", "top", "up"), class = "factor")), .Names = c("Size",
"Isolate", "spin"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(V2 = structure(1:7, .Label = c("alpha", "bravo",
"charlie", "delta", "echo", "foxtrot", "golf"), class = "factor")), .Names = "V2", class = "data.frame", row.names = c(NA,
-7L))

Resources