I made a frequency table which have overlapping strata, which lead to data as follows:
dat <- structure(list(`[0,25)` = 5L, `[100,250)` = 43L, `[100,500)` = 0L,
`[1000,1000000]` = 20L, `[1000,1500)` = 0L, `[1500,3000)` = 0L,
`[25,100)` = 38L, `[25,50)` = 0L, `[250,500)` = 27L, `[3000,1000000]` = 0L,
`[50,100)` = 0L, `[500,1000)` = 44L, `[500,1000000]` = 0L), row.names = "Type_A", class = "data.frame")
The brackets, which are the column names of dat are in alphabetical order. I need them to be in numerical order, but I find it quite tricky.
Desired order:
[0,25), [25,50), [25,100), [50,100), ..., [3000,1000000]
What would be the best way to do this?
Here is a bit tricky, but maybe useful way:
ord <- gsub("\\[|\\]|\\)", "", colnames(dat)) %>%
strsplit(",") %>%
lapply(as.numeric) %>%
lapply(sum) %>%
unlist %>%
order()
colnames(dat)[ord]
# [1] "[0,25)" "[25,50)" "[25,100)" "[50,100)"
# [5] "[100,250)" "[100,500)" "[250,500)" "[500,1000)"
# [9] "[1000,1500)" "[1500,3000)" "[500,1000000]" "[1000,1000000]"
# [13] "[3000,1000000]"
dat[ord]
# [0,25) [25,50) [25,100) [50,100) [100,250) [100,500) #[250,500) [500,1000)
#Type_A 5 0 38 0 43 0 #27 44
# [1000,1500) [1500,3000) [500,1000000] [1000,1000000] #[3000,1000000]
#Type_A 0 0 0 20 0
Related
My input data df is:
Action Difficulty strings characters POS NEG NEU
Field 0.635 7 59 0 0 7
Field or Catch 0.768 28 193 0 0 28
Field or Ball -0.591 108 713 6 0 101
Ball -0.717 61 382 3 0 57
Catch -0.145 89 521 1 0 88
Field 0.28 208 1214 2 3 178
Field and run 1.237 18 138 1 0 17
I am interested in group-based correlations of Difficulty with the remaining variables strings, characters, POS, NEG, NEU. The grouping variable is Action. If I am interested only in the group Field, I can do filter(str_detect(Action, 'Field')).
I can do it one by one between Difficulty and the remaining variables.
But is there a faster way to do it in one command with multiple variables?
My partial solution is:
df %>%
filter(str_detect(Action, 'Field')) %>%
na.omit %>% # Original data had multiple NA
group_by(Action) %>%
summarise_all(funs(cor))
But this results in an error.
Some relevant SO posts that I looked at are: This is quite relevant to generate a correlation matrix but does not address my question Find correlation coefficient of two columns in a dataframe by group. Useful to compute different types of correlations and introduces a different way of ignoring NAs: Check the correlation of two columns in a dataframe (in R)
Any help or guidance on this would be greatly appreciated!
For reference, this is the sample dput()
structure(list(
Action = c("Field", "Field or Catch", "Field or Ball", "Ball", "Catch", "Field", "Field and run"), Difficulty = c(0.635, 0.768, -0.591, -0.717, -0.145, 0.28, 1.237),
strings = c(7L, 28L, 108L, 61L, 89L, 208L, 18L),
characters = c(59L, 193L, 713L, 382L, 521L, 1214L, 138L),
POS = c(0L, 0L, 6L, 3L, 1L, 2L, 1L),
NEG = c(0L, 0L, 0L, 0L, 0L, 3L, 0L),
NEU = c(7L, 28L, 101L, 57L, 88L, 178L, 17L)),
class = "data.frame", row.names = c(NA,
-7L))
You may try -
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Action, 'Field')) %>%
na.omit %>% # Original data had multiple NA
group_by(Action) %>%
summarise(across(-Difficulty, ~cor(.x, Difficulty)))
If you don't want to group_by Action -
df %>%
filter(str_detect(Action, 'Field')) %>%
na.omit %>%
summarise(across(-c(Difficulty, Action), ~cor(.x, Difficulty)))
# strings characters POS NEG NEU
#1 -0.557039 -0.5983826 -0.8733465 -0.1520684 -0.5899733
I have two columns MPH, Threshold, Car. I’d like to write some code to return the MPH for the column car when the first instance of threshold is 1.
MPH Threshold Car
30 0 A
31 0 A
32 1 A
33 1 A
34 1 A
35 1 A
30 0 B
31 0 B
32 0 B
33 0 B
34 1 B
35 1 B
Desired Output:
Value Car
32 A
34 B
Assuming you'll always have at-least one value where Threshold = 1 for each Car we can do
library(dplyr)
df %>%
group_by(Car) %>%
slice(which.max(Threshold == 1)) %>%
select(-Threshold)
# MPH Car
# <int> <fct>
#1 32 A
#2 34 B
Of using base R ave
df[with(df, ave(Threshold == 1, Car, FUN = function(x)
seq_along(x) == which.max(x))), ]
You can also do
library(dplyr)
df %>%
filter(Threshold == 1) %>%
subset(!duplicated(Car))
library(data.table)
dt <- data.table(df)
dt[Threshold == 1, ][!duplicated(Car),]
An option with data.table
library(data.table)
i1 <- setDT(df)[, .I[which(Threshold == 1)[1]], Car]$V1
df[i1, .(Value = MPH, Car)]
# Value Car
#1: 32 A
#2: 34 B
data
df <- structure(list(MPH = c(30L, 31L, 32L, 33L, 34L, 35L, 30L, 31L,
32L, 33L, 34L, 35L), Threshold = c(0L, 0L, 1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 1L, 1L), Car = c("A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-12L))
For a report I am summarizing data by a group. Due to copyright issues I have created some dummy data below (first colum is group, then values):
X A B C D
1 1 12 0 12 0
2 2 24 0 15 0
3 3 56 0 48 0
4 4 89 0 96 0
5 5 13 3 65 0
6 6 11 16 0 0
7 7 25 19 0 0
8 8 24 98 0 0
9 9 18 111 0 0
10 10 173 125 0 0
11 11 10 65 0 0
I would like to create a barplot for every group (1:11) with a loop:
for(i in 1:11){x<-dummyloop[i,]
barplot(as.matrix(x), main=paste("Group", i), ylim=c(0,200))}
This works, I get a barplot for every loop, however they end up in one 4 by for plotting window as if I had used par(mfrow=c(4,4)).
I need individual bar plots.
So I used par(mfrow=c(1,1)), which for some reason fixed the problem (I don't use par EVER, because I am only exporting for a scientific report featuring individual graphs), however the "main" is cut off on the top.
I would also like each bar to be a different color, so I used:
for(i in 1:11){x<-dummyloop[i,]
barplot(as.matrix(x), main=paste("Group", i), col=c(1:5),
ylim=c(0,200))}
Realizing that the coloring vector then only uses the first color, I tried variations of this:
for(i in 1:11){x<-dummyloop[i,]
barplot(as.matrix(x), main=paste("Group", i), col=c(4:10)[1:ncol(x)],
ylim=c(0,200))}
which doesn't do the trick...
I seem to be missing some key detail in the for loop here, thanks for help. I'm an R novice getting better every day thanks to the people here ;).
No idea, why that is in base plot. Here is a alternative way with ggplot2.
for(i in 1:11){x<- gather(data[i,])
print(ggplot(data = x, aes(x = key, y = value, fill = cols)) +
geom_bar(stat = "identity", show.legend = FALSE) +
ggtitle(paste("Group ", i)) + theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,200))
}
So is your mainstill cut off?
Then extend the margin on top of the plot. Execute:
par(mar = c(2, 2, 3 , 2)) # c(bottom, left, top, right)
Before plotting. You can reset your specifications with dev.off() when experimenting.
Staying base R, you simply could use by and set col according to the group.
colors <- rainbow(length(unique(dat$X))) # define colors, 11 in your case
by(dat, dat$X, function(x)
barplot(as.matrix(x), main=paste("Group", x$X), ylim=c(0, 200), col=colors[x$X]))
Data
dat <- structure(list(X = 1:11, A = c(12L, 24L, 56L, 89L, 13L, 11L,
25L, 24L, 18L, 173L, 10L), B = c(0L, 0L, 0L, 0L, 3L, 16L, 19L,
98L, 111L, 125L, 65L), C = c(12L, 15L, 48L, 96L, 65L, 0L, 0L,
0L, 0L, 0L, 0L), D = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11"))
I am a novice in R language. I am having text file separated by tab available with sales data for each day. The format will be like product-id, day0, day1, day2, day3 and so on. The part of the input file given below
productid 0 1 2 3 4 5 6
1 53 40 37 45 69 105 62
4 0 0 2 4 0 8 0
5 57 133 60 126 90 87 107
6 108 130 143 92 88 101 66
10 0 0 2 0 4 0 36
11 17 22 16 15 45 32 36
I used code below to read a file
pdInfo <- read.csv("products.txt",header = TRUE, sep="\t")
This allows to read the entire file and variable x is a data frame. I would like to change data.frame x to time series object in order for the further processing.On a stationary test, Dickey–Fuller test (ADF) it shows an error. I tried the below code
x <- ts(data.matrix(pdInfo),frequency = 1)
adf <- adf.test(x)
error: Error in adf.test(x) : x is not a vector or univariate time series
Thanks in advance for the suggestions
In R, time series are usually in the form "one row per date", where your data is in the form "one column per date". You probably need to transpose the data before you convert to a ts object.
First transpose it:
y= t(pdInfo)
Then make the top row (being the product id's) into the row titles
colnames(y) = y[1,]
y= y[-1,] # to drop the first row
This should work:
x = ts(y, frequency = 1)
library(purrr)
library(dplyr)
library(tidyr)
library(tseries)
# create the data
df <- structure(list(productid = c(1L, 4L, 5L, 6L, 10L, 11L),
X0 = c(53L, 0L, 57L, 108L, 0L, 17L),
X1 = c(40L, 0L, 133L, 130L, 0L, 22L),
X2 = c(37L, 2L, 60L, 143L, 2L, 16L),
X3 = c(45L, 4L, 126L, 92L, 0L, 15L),
X4 = c(69L, 0L, 90L, 88L, 4L, 45L),
X5 = c(105L, 8L, 87L, 101L, 0L, 32L),
X6 = c(62L, 0L, 107L, 66L, 36L, 36L)),
.Names = c("productid", "0", "1", "2", "3", "4", "5", "6"),
class = "data.frame", row.names = c(NA, -6L))
# apply adf.test to each productid and return p.value
adfTest <- df %>% gather(key = day, value = sales, -productid) %>%
arrange(productid, day) %>%
group_by(productid) %>%
nest() %>%
mutate(adf = data %>% map(., ~adf.test(as.ts(.$sales)))
,adf.p.value = adf %>% map_dbl(., "p.value")) %>%
select(productid, adf.p.value)
my issue is that within a loop for every i - a matrix like this outputted
structure(c(8L, 4L, 3L, 4L, 1L, 8L, 28L, 32L, 24L, 32L, 8L, 64L,
0L, 6L, 12L, 16L, 4L, 32L, 0L, 0L, 3L, 12L, 3L, 24L, 0L, 0L,
0L, 6L, 4L, 32L, 0L, 0L, 0L, 0L, 0L, 8L, 0L, 0L, 0L, 0L, 0L,
28L), .Dim = 6:7, .Dimnames = structure(list(c("ESN", "GWD",
"LWK", "MSL", "PEL", "YRI"), c("ACB", "ESN", "GWD", "LWK", "MSL",
"PEL", "YRI")), .Names = c("", "")), class = "table")
this matrix counts pariwise sharing - these counts should now be added to a larger table - with more levels than only the 7 present in this table. It is always a symmetric matrix (so the upper triangl) can be neglected
the real table (for which all elements are 0 in the beginnign)
matr<-matrix(0,nrow=26,ncol=26)
pop<-c("CHB","JPT","CHS","CDX","KHV","CEU","TSI","FIN","GBR","IBS","YRI","LWK","GWD","MSL","ESN","ASW","ACB","MXL","PUR","CLM","PEL","GIH","PJL","BEB","STU","ITU")
rownames(matr)<-pop
colnames(matr)<-pop
Can somebody tell me how I can add these counts from the small table to the large table (in the correct field) in an efficient way? I need to update the table 100k time - so effectiveness would be good. As mentioned addiing in the lower triangle is fine....
EDI #####
so another data set - might look like (this would then be generated from the next iteration of the loop)
structure(c(1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), .Dim = c(3L,
3L), .Dimnames = structure(list(c("IBS", "MXL", "TSI"), c("GBR",
"IBS", "MXL")), .Names = c("", "")), class = "table")
this should then also be added to matr - if a field has a number in it previously, the two number should be added up
Thanks
Taking into account duplicate/non-equal/non-zero entries in each of "table" created through iterations and updating only the lower.tri of "matr":
for(tab in tabs) {
## if each 'tab' is large enough,
## instead of creating (and subsetting with) 'row(tab)' and 'col(tab)'
##, a 'rep(, each = )' could be used
i = match(rownames(tab), rownames(mat))[row(tab)]
j = match(colnames(tab), colnames(mat))[col(tab)]
## to fill only the 'lower.tri'
ii = pmax(i, j); jj = pmin(i, j)
## sum duplicate entries 'tab' with 'sparseMatrix's intrinsic 'xtabs'-like behaviour
ijx = summary(sparseMatrix(ii, jj, x = c(tab)))
## subset and assign with a matrix index updating previous entries
ij = cbind(ijx$i, ijx$j)
mat[ij] = mat[ij] + ijx$x
}
mat
# a b c d e
#a 0 0 0 0 0
#b 4 1 0 0 0
#c 6 7 2 0 0
#d 5 12 5 7 0
#e 4 6 3 3 0
where "tabs" is a "list" containing the -iteratively- created "table"s:
set.seed(007)
tabs = replicate(3, table(replicate(2,
sample(letters[1:5], 50, TRUE), simplify = FALSE))[
sample(5, sample(2:5, 1)), sample(5, sample(2:5, 1))],
simplify = FALSE)
and "mat" is a smaller "matr":
mat = matrix(0L, 5, 5, dimnames = replicate(2, letters[1:5], simplify = FALSE))