How does the table and $freq function work in R - r

I want a function for the mode of a vector. Abhiroop Sarkar's answer to This question works, but I want to understand why.
Here is the code
Mode <- function(x){
y <- data.frame(table(x))
y[y$Freq == max(y$Freq),1]
}
1) Wy do we need to put the table in a data frame,
2) in this line
y[y$Freq == max(y$Freq),1]
what does the y$Freq do? is frequency a default columns in the table?

When we convert a table output to data.frame, it creates a two column data.frame
set.seed(24)
v1 <- table(sample(1:5, 100, replace = TRUE))
y <- data.frame(v1)
y
# Var1 Freq
#1 1 19
#2 2 24
#3 3 22
#4 4 16
#5 5 19
The first column 'Var1' is the names of the frequency output from table and the 'Freq' is the actual frequency of those names
y[y$Freq == max(y$Freq), 1]
#[1] 2
#Levels: 1 2 3 4 5
Now, we are subsetting the first column 'Var1' based on the max value of 'Freq', and it returns a vector because of the drop = TRUE in [ when there is a single column
If we want to return a data.frame with single, add drop = FALSE at the end
y[y$Freq == max(y$Freq), 1, drop = FALSE]
# Var1
#2 2
Regarding the default name Freq, it is created from the as.data.frame.table method
as.data.frame.table
function (x, row.names = NULL, ..., responseName = "Freq", stringsAsFactors = TRUE,
sep = "", base = list(LETTERS))
{
ex <- quote(data.frame(do.call("expand.grid", c(dimnames(provideDimnames(x,
sep = sep, base = base)), KEEP.OUT.ATTRS = FALSE, stringsAsFactors = stringsAsFactors)),
Freq = c(x), row.names = row.names))
names(ex)[3L] <- responseName
eval(ex)
}

Related

Catching an error based on unique id values in a data.frame

I want to create a stop() for my data.frame below such that for each unique id, if pos is varying (e.g., consisting of 1s, 2s,etc.), then, if mp values for the rows for which cont==TRUE are not the same, we should throw an error.
Is this possible in R?
In the below toy example, id == "B" should throw an error because, pos is varying (1,2,3), and mp values (i.e., 1,3) for the rows for which cont==TRUE are not the same.
dat <- data.frame(id = rep(c("A","B"),2:3), mp = c(1,5, 2,1,3), cont = c(F,T, F,T,T), pos = c(1,1, 1:3))
# id mp cont pos
#1 A 1 FALSE 1
#2 A 5 TRUE 1
#3 B 2 FALSE 1
#4 B 1 TRUE 2
#5 B 3 TRUE 3
# Desired stop() message:
"Error: 'B' has a wrong value."
In base R, one option is to split the subset of data i.e where 'cont' is TRUE by 'id' into a list. Then loop over the list along with the names of the list in Map, check if the unique rows have more than 1 row, then call the stop
lst1 <- split(dat[dat$cont,c("mp", "pos")], dat$id[dat$cont])
Map(function(x, y) if(nrow(unique(x)) > 1)
stop(sprintf("'%s' has a wrong value.", y), call. = FALSE),
lst1, names(lst1))
#Error: 'B' has a wrong value.
With the updated example
lst1 <- split(dat[dat$control, c("mpre", "post")], dat$study.name[dat$control])
Map(function(x, y) if(all(lengths(lapply(x, unique)) > 1))
stop(dQuote(sprintf("'%s' has a wrong value.", y), FALSE), call. = FALSE),
lst1, names(lst1))
#Error: "'Brown' has a wrong value."

If else ladder not working in R

I have this in my dataframe after reading and rearranging multiple csv files. Basically I want an if else ladder to refer to the ID column and if it matches a number from the list of concatenates then place a word in a new "group" column
# of int. int. not.int. ID
1 50 218.41 372.16 1
3 33 134.94 158.17 3
I then made these concatenates to refer to.
veh = as.character(c('1', '5'))
thc1 = as.character(c('2', '6'))
thc2 = as.character(c('3', '7'))
thc3 = as.character(c('4', '8'))
Then I made an if else ladder to list the corresponding values.
social.dat$group = if (social.dat$ID == veh) {
social.dat$group == "veh"
} else if (social.dat$group == thc1) {
social.dat$group == "thc1"
} else if (social.dat$group == thc2) {
social.dat$group == "thc2"
} else {
social.dat$group == "thc3"
}
However, I then get this warning message.
Warning message:
In if (social.dat$ID == veh) { :
the condition has length > 1 and only the first element will be used
I have looked up this warning message in multiple different variations and have not found anything that really helped. Any help for this would be much appreciated or and alternate options would be good as well. I apologize in advance if there was a solution on stack already if I missed it.
EDIT:
I tried using
social.dat$group = ifelse(social.dat$ID == veh, "veh", "thc")
social.dat$group = ifelse(social.dat$ID == thc, "thc", "veh")
but it changed the output of the dataframe after each line.
Here is the full code i am using to rearrange the csv files and get the dataframe that I first mentioned above.
#calls packages
library(tidyr)
library( plyr )
library(reshape2)
#make sure to change your working directory to where the files are kept
setwd("C:/Users/callej03/Desktop/test")
wd = "C:/Users/callej03/Desktop/test"
files = list.files(path=wd, pattern="*.csv", full.names=TRUE,
recursive=FALSE)
################################################################
#this function creates a list of the number of interactions for each file in
the folder
lap.list = lapply(files, function(x) {
dat = read.csv(x, header= TRUE)
dat = dat[-c(1),]
dat = as.data.frame(dat)
dat = separate(data = dat, col = dat, into = c("lap", "duration"), sep = "\\
")
dat$count = 1:nrow(dat)
y = dat$count
i= y%%2==0
dat$interacting = i
int = dat[which(dat$interacting == TRUE),]
interactions = sum(int$interacting)
})
#########################################################################
#this changes the row name to the name of the file - i.e. the rat ID
lap.list = as.data.frame(lap.list)
lap.list = t(lap.list)
colnames(lap.list) = c("# of int.")
row.names(lap.list) = sub(wd, "", files)
row.names(lap.list) = gsub("([0-9]+).*$", "\\1", rownames(lap.list))
row.names(lap.list) = gsub('/', "", row.names(lap.list), fixed = TRUE)
###########################################################################
#this applies almost the same function as the one listed above except I call
it a different vector name so it can be manipulated
int.duration = lapply(files, function(x) {
dat2 = read.csv(x, header= TRUE)
dat2 = dat2[-c(1),]
dat2 = as.data.frame(dat2)
dat2 = separate(data = dat2, col = dat2, into = c("lap", "duration"), sep =
"\\ ")
dat2$count = 1:nrow(dat2)
y = dat2$count
i= y%%2==0
dat2$interacting = i
int = dat2[which(dat2$interacting == TRUE),]
})
noint.duration = lapply(files, function(x) {
dat2 = read.csv(x, header= TRUE)
dat2 = dat2[-c(1),]
dat2 = as.data.frame(dat2)
dat2 = separate(data = dat2, col = dat2, into = c("lap", "duration"), sep =
"\\ ")
dat2$count = 1:nrow(dat2)
y = dat2$count
i= y%%2==0
dat2$interacting = i
noint = dat2[which(dat2$interacting == FALSE),]
})
###################################################################
#this splits the output time of minutes, seconds, and milliseconds.
#then it combines them into a total seconds.milliseconds readout.
#after that, it takes the sum of the times for each file and combines them
with the total interactions dataframe.
int.duration = melt(int.duration)
int.duration = as.data.frame(int.duration)
int.left = as.data.frame(substr(int.duration$duration, 1, 2))
colnames(int.left) = "min"
int.mid = as.data.frame(substr(int.duration$duration, 4, 4 + 2 - 1))
colnames(int.mid) = "sec"
int.right = as.data.frame(substr(int.duration$duration,
nchar(int.duration$duration) - (2-1), nchar(int.duration$duration)))
colnames(int.right) = "ms"
int.time = cbind(int.left, int.mid, int.right)
int.time$min = as.numeric(as.character(int.time$min))
int.time$sec = as.numeric(as.character(int.time$sec))
int.time$ms = as.numeric(as.character(int.time$ms))
int.time$ms = int.time$ms/100
int.time$min = ifelse(int.time$min > 0, int.time$min*60,0)
int.time$sum = rowSums(int.time)
int.file = as.data.frame(int.duration$L1)
colnames(int.file) = "file"
int.time = cbind(int.time, int.file)
int.tot = as.data.frame(tapply(int.time$sum, int.time$file, sum))
colnames(int.tot) = "int."
social.dat = cbind(lap.list, int.tot)
noint.duration = melt(noint.duration)
noint.duration = as.data.frame(noint.duration)
noint.left = as.data.frame(substr(noint.duration$duration, 1, 2))
colnames(noint.left) = "min"
noint.mid = as.data.frame(substr(noint.duration$duration, 4, 4 + 2 - 1))
colnames(noint.mid) = "sec"
noint.right = as.data.frame(substr(noint.duration$duration,
nchar(noint.duration$duration) - (2-1), nchar(noint.duration$duration)))
colnames(noint.right) = "ms"
noint.time = cbind(noint.left, noint.mid, noint.right)
noint.time$min = as.numeric(as.character(noint.time$min))
noint.time$sec = as.numeric(as.character(noint.time$sec))
noint.time$ms = as.numeric(as.character(noint.time$ms))
noint.time$ms = noint.time$ms/100
noint.time$min = ifelse(noint.time$min > 0, noint.time$min*60,0)
noint.time$sum = rowSums(noint.time)
noint.file = as.data.frame(noint.duration$L1)
colnames(noint.file) = "file"
noint.time = cbind(noint.time, noint.file)
noint.tot = as.data.frame(tapply(noint.time$sum, noint.time$file, sum))
colnames(noint.tot) = "not.int."
social.dat = cbind(social.dat, noint.tot)
social.dat$ID = rownames(social.dat)
Here is and axample of a csv file I am working with. The words are all in the same column and separated by spaces.
Total time 10:00.61
Lap times
01 00:07.46
02 00:05.64
03 00:01.07
04 00:01.04
05 00:04.71
06 00:06.43
07 00:12.52
08 00:07.34
09 00:05.46
10 00:05.81
11 00:05.52
12 00:06.51
13 00:10.75
14 00:00.83
15 00:03.64
16 00:02.75
17 00:01.20
18 00:06.17
19 00:04.40
20 00:00.75
21 00:00.84
22 00:01.29
23 00:02.31
24 00:03.04
25 00:02.85
26 00:05.86
27 00:05.76
28 00:05.06
29 00:00.96
30 00:06.91
#akrun suggested ifelse, which works great for one or two nestings. Much past that, and my personal preference is to use dplyr::case_when or a separate data.frame in a merge/join of sorts.
If you are using the "simple case" of assigning consistently by the same fields (id in this case), then the merge/join is my preferred method: it makes maintenance much simpler (IMO). (When I say "consistently by the same fields", I mean that you could have a id1 and id2 fields by which you define the individual records and their applicable groups. Likely too much for your example, so I'll keep this answer at one key merging.)
Three methods (data far below):
Base R
dat2a <- merge(dat, groups, by="id", all.x=TRUE)
dat2a
# id int group
# 1 1 22 veh
# 2 2 33 thc1
# 3 3 44 <NA>
Note that any id not included in the definition of groups will have NA group. You can assign a default group with this:
dat2a$group[is.na(dat2a$group)] <- "somedefaultgroup"
dat2a
# id int group
# 1 1 22 veh
# 2 2 33 thc1
# 3 3 44 somedefaultgroup
dplyr, merge/join
Similar concept, but using dplyr-esque verbs.
library(dplyr)
dat2c <- left_join(dat, groups, by="id") %>%
mutate(group = if_else(is.na(group), "somedefaultgroup", group))
dplyr::case_when
(This does not use groups as I defined for the merge/join cases.)
In case you really want to do some ladder/nesting of if/else-like statements, case_when is easier to read (and debug) and might be faster, depending on your use-case.
Most direct:
library(dplyr)
dat2b <- dat
dat2b$group <- case_when(
dat2b$id %in% c("1","5") ~ "veh",
dat2b$id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
)
A little easier to read than the previous by using with(...), but functionally identical. (If your "ladder" is much longer, then code-golf (number of characters in the code) can be significantly reduced.)
dat2b <- dat
dat2b$group <- with(dat2b, case_when(
id %in% c("1","5") ~ "veh",
id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
))
If you want to use some dplyr verbs, then:
dat2b <- dat
dat2b <- dat2b %>%
mutate(
group = case_when(
id %in% c("1","5") ~ "veh",
id %in% c("2","6") ~ "thc1",
TRUE ~ "somedefaultgroup"
)
)
Data
When doing merge/join actions, it's important to use stringsAsFactors=FALSE so that the absence of factor levels (of the newly-assigned groups) is not a problem. (This can be worked around, but ...)
dat <- data.frame(id=c("1","2","3"), int=c(22L,33L,44L),
stringsAsFactors=FALSE)
Optional use for the merge examples above:
groups <- data.frame(id=c("1","5","2","6"), group=c("veh","veh","thc1","thc1"),
stringsAsFactors=FALSE)
groups
# id group
# 1 1 veh
# 2 5 veh
# 3 2 thc1
# 4 6 thc1
The premise is that you define one row for each unique id.
Thanks to #r2evans the following code worked exactly as I wanted it to (using dplyr::case_when)
social.dat$group = case_when(
social.dat$ID %in% c("1","5") ~ "veh",
social.dat$ID %in% c("2","6") ~ "thc1",
social.dat$ID %in% c("3","7") ~ "thc2",
social.dat$ID %in% c("4","8") ~ "thc3"
)
This was the final output of the dataframe
# of int. int. not.int. ID group
1 50 218.41 372.16 1 veh
3 33 134.94 158.17 3 thc2

Combining values Boolean columns to one with Priority in R

Gone through below links but it solved my problem partially.
merge multiple TRUE/FALSE columns into one
Combining a matrix of TRUE/FALSE into one
R: Converting multiple boolean columns to single factor column
I have a dataframe which looks like:
dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
A = c('Y','N','N','N','N','N','N','N'),
B = c('N','Y','N','N','N','N','Y','N'),
C = c('N','N','Y','N','N','Y','N','N'),
D = c('N','N','N','Y','N','Y','N','N'),
E = c('N','N','N','N','Y','N','Y','N')
)
I want to make a reshape my df with one column but it has to give priorities when there are 2 "Y" in a row.
THE priority is A>B>C>D>E which means if their is "Y" in A then the resultant value should be A. Similarly, in above example df both C and D has "Y" but there should be "C" in the resultant df.
Hence output should look like:
resultant_dat <- data.frame(Id = c(1,2,3,4,5,6,7,8),
Result = c('A','B','C','D','E','C','B','NA')
)
I have tried this:
library(reshape2)
new_df <- melt(dat, "Id", variable.name = "Result")
new_df <-new_df[new_df$value == "Y", c("Id", "Result")]
But the problem is doesn't handle the priority thing, it creates 2 rows for the same Id.
tmp = data.frame(ID = dat[,1],
Result = col_order[apply(
X = dat[col_order],
MARGIN = 1,
FUN = function(x) which(x == "Y")[1])],
stringsAsFactors = FALSE)
tmp$Result[is.na(tmp$Result)] = "Not Present"
tmp
# ID Result
#1 1 A
#2 2 B
#3 3 C
#4 4 D
#5 5 E
#6 6 C
#7 7 B
#8 8 Not Present

Sum observations from two columns, looping over many columns in R

I have searched high and low, but am stuck on how to approach this. I have two sets of columns that I want to sum, row by row, but which I want to loop over many columns. If I were to do this manually, I would want:
df1[1,1]+df2[1,1]
df1[2,1]+df2[2,1]
etc... I've found many helpful examples on how to do something like:
apply(df[,c("a","d")], 1, sum)
though I want to do this over lots of columns. Also, while it's not entirely relevant, I want to phrase my question as close to my reality as possible, so my example below includes NA's, since my actual data contains many missing values.
# make a data frame, df1, with three columns
a <- sample(1:100, 50, replace = T)
b <- sample(100:300, 50, replace = T)
c <- sample(2:50, 500, replace = T)
df1 <- cbind(a,b,c)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
# make another data frame, df2, with three columns
x <- sample(1:100, 50, replace = T)
y <- sample(100:300, 50, replace = T)
z <- sample(2:50, 50, replace = T)
df2 <- cbind(x,y,z)
Make it possible to randomly throw a few NAs in, function from http://www.r-bloggers.com/function-to-generate-a-random-data-set/
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
Add the NAs to the frames
NAins(df1, .2)
NAins(df2, .14)
Then, I tried to seq along the columns in each data frame, and used apply setting the index to 1, meaning to sum each row entry. This doesn't work.
for(i in seq_along(df1)){
for(j in seq_along(df2)){
apply(c(df1[,i], col2[j]), 1, function(x) sum(x, na.rm = T))}}
Thanks for any help!
You should be able to just replace NA with 0, and then add with "+":
replace(df1, is.na(df1), 0) + replace(df2, is.na(df2), 0)
# X Y Z
# 1 7 19 6
# 2 11 12 1
# 3 16 14 11
# 4 13 7 13
# 5 10 2 11
Alternatively, if you have more than just two data.frames, you can collect them in a list and use Reduce:
Reduce("+", lapply(mget(c("df1", "df2", "df3")), function(x) replace(x, is.na(x), 0)))
Here's some sample data (and what I think is an easier way to create it):
set.seed(1) ## Set a seed so others can reproduce your sample data
dfmaker <- function() {
setNames(
data.frame(
replicate(3, sample(c(NA, 1:10), 5, TRUE), FALSE)),
c("X", "Y", "Z"))
}
df1 <- dfmaker()
df1
# X Y Z
# 1 2 9 2
# 2 4 10 1
# 3 6 7 7
# 4 9 6 4
# 5 2 NA 8
df2 <- dfmaker()
df2
# X Y Z
# 1 5 10 4
# 2 7 2 NA
# 3 10 7 4
# 4 4 1 9
# 5 8 2 3
df3 <- dfmaker()
You can transform the data.frame to an array and sum them using apply function.
install.package('abind')
library(abind)
df <- abind(list(df1,df2), along = 3)
results <- apply(df, MARGIN = c(1,2), FUN = function(x) sum(x, na.rm = TRUE))
results

aggregate values in dataframe by partly matching rownames in R

I'm thumbling around with the following problem, but to no evail:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
value
abc 1
abcd 2
ef 3
gh 4
l <- nrow(d)
wordmat <- matrix(rep(NA, l^2), l, l, dimnames = list(row.names(d), row.names(d)))
for (i in 1:ncol(wordmat)) {
rid <- agrep(colnames(wordmat)[i], rownames(wordmat), max = 0)
d$matchid[i] <- paste(rid, collapse = ";")
}
# desired output:
(d_agg <- data.frame(value = c(3, 3, 4), row.names = c("abc;abcd", "ef", "gh")))
value
abc;abcd 3
ef 3
gh 4
is there a function for this?
Here's a possible solution that you might be able to modify to suit your needs.
Some notes:
I couldn't figure out how to deal with rownames() directly, particularly in the last stage, so this depends on you being happy with copying your row names as a new variable.
The function below "hard-codes" the variable names, functions, and so on. That is to say, it is not by any means a generalized function, but one which might be useful as you look further into this problem.
Here's the function.
matches <- function(data, ...) {
temp = vector("list", nrow(data))
for (i in 1:nrow(data)) {
temp1 = agrep(data$RowNames[i], data$RowNames, value = TRUE, ...)
temp[[i]] = data.frame(RowNames = paste(temp1, collapse = "; "),
value = sum(data[temp1, "value"]))
}
temp = do.call(rbind, temp)
temp[!duplicated(temp$RowNames), ]
}
Note that the function needs a column called RowNames, so we'll create that, and then test the function.
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
d$RowNames <- rownames(d)
matches(d)
# RowNames value
# 1 abc; abcd 3
# 3 ef 3
# 4 gh 4
matches(d, max.distance = 2)
# RowNames value
# 1 abc; abcd 3
# 3 abc; abcd; ef; gh 10
matches(d, max.distance = 4)
# RowNames value
# 1 abc; abcd; ef; gh 10
This works for your example but may need tweaking for the real thing:
d <- data.frame(value = 1:4, row.names = c("abc", "abcd", "ef", "gh"))
rowclust <- hclust(as.dist(adist(rownames(d))), method="single")
rowgroups <- cutree(rowclust, h=1.5)
rowagg <- aggregate(d, list(rowgroups), sum)
rowname <- unclass(by(rownames(d), rowgroups, paste, collapse=";"))
rownames(rowagg) <- rowname
rowagg
Group.1 value
abc;abcd 1 3
ef 2 3
gh 3 4

Resources