I have a matrix matrix with two level groupings as illustrated in the row and column names.
UKC1_SS1 UKC1_SS2 UKC2_SS1 UKC2_SS2
UKC1_SS1 1 2 3 4
UKC1_SS2 5 6 7 8
UKC2_SS1 9 10 11 12
UKC2_SS2 13 14 15 16
I want to create with a table with the column and row sums based on the first four digits of the column and row names:
UKC1 UKC2
UKC1 14 22
UKC2 46 54
I tried calculating rowsums and colSums sequentially,
sum.matrix <- rowsum(matrix, substr(rownames(matrix), start = 1, stop = 4))
sum.matrix <- colSums(sum.matrix, substr(colnames(test), start = 1, stop = 4)
but I receive the following error message:
Error in colSums(test, substr(colnames(test), start = 1, stop = 4)) :
invalid 'na.rm' argument
When I run sum(is.na) I confirm that there are NA values in matrix .
We can do the sum with xtabs after changing the dimnames with the substr of 1st 4 characters
dimnames(m1) <- lapply(dimnames(m1), substr, 1, 4)
xtabs(Freq~ Var1 + Var2, as.data.frame.table(m1))
# Var2
#Var1 UKC1 UKC2
# UKC1 14 22
# UKC2 46 54
data
m1 <- structure(c(1L, 5L, 9L, 13L, 2L, 6L, 10L, 14L, 3L, 7L, 11L, 15L,
4L, 8L, 12L, 16L), .Dim = c(4L, 4L), .Dimnames = list(c("UKC1_SS1",
"UKC1_SS2", "UKC2_SS1", "UKC2_SS2"), c("UKC1_SS1", "UKC1_SS2",
"UKC2_SS1", "UKC2_SS1.1")))
Related
I am given a dataframe with 10 students, each one having a score for 4 different tests. i must select the 3 best scores and make their average using these 3
noma interro1 interro2 interro3 interro4
1 836016120449 6 3 NA 3
2 596844884419 1 4 2 8
3 803259953398 2 2 9 1
4 658786759629 3 1 3 2
5 571155022756 4 9 1 4
6 576037886365 8 7 8 7
7 045086625199 9 6 7 6
8 621909979467 5 8 4 5
9 457029205538 7 5 6 9
10 402526220817 NA 10 5 10
This dataframe provides the scores for 4 tests for 10 students.
Write a function that calculates the average score for the 3 best tests.
Calculate this average score for the 10 students.
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2, 1, average)
considering i want the 3 best, i thought that sort() could be useful here, however, what i receive is
In mean.default(x, na.rm = TRUE) :
argument is not numeric or logical: returning NA
i tried this one too
average <- function(t){
rowMeans(sort(t, decreasing = TRUE, na.rm=TRUE)[1:3])
}
UPDATE: answered, the dimensions of the dataframe were not correct in the apply line, i had to remove the first one which contained the names of the students, thus this one bellow works
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2[-1], 1, average)
Try pivot the scores, then sort the scores by name and keep the top 3 scores. Finally take the average grouping by name:
library(dplyr)
library(tidyr)
data <- data.frame(
stringsAsFactors = FALSE,
noma = c("836016120449","596844884419",
"803259953398","658786759629","571155022756",
"576037886365","045086625199","621909979467","457029205538",
"402526220817"),
interro1 = c(6L, 1L, 2L, 3L, 4L, 8L, 9L, 5L, 7L, NA),
interro2 = c(3L, 4L, 2L, 1L, 9L, 7L, 6L, 8L, 5L, 10L),
interro3 = c(NA, 2L, 9L, 3L, 1L, 8L, 7L, 4L, 6L, 5L),
interro4 = c(3L, 8L, 1L, 2L, 4L, 7L, 6L, 5L, 9L, 10L)
)
data <- data %>% pivot_longer(!noma, names_to = "interro", values_to = "value") %>% replace_na(list(value=0))
data_new1 <- data[order(data$noma, data$value, decreasing = TRUE), ] # Order data descending
data_new1 <- Reduce(rbind, by(data_new1, data_new1["noma"], head, n = 3)) # Top N highest values by group
data_new1 <- data_new1 %>% group_by(noma) %>% summarise(Value_mean = mean(value))
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y.
The X column is to contain the values from the A column, but with the division performed. Values from the first row (from column A) divided by the values from the second row in column A and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
The Y column is to contain the values from the B column, but with the division performed. Values from the first row (from column B) divided by the values from the second row in column B and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
So far I used Excel for this. But now I need it in R if possible in the form of a function so that I can reuse this code easily. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample results:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c("",
"0,285714286", "", "0,454545455", "", "4,5", "", "2,428571429",
"", "1,666666667"), Y = c("", "0,6", "", "0,047619048", "", "2,09375",
"", "0,791666667", "", "1,189189189")), class = "data.frame", row.names = c(NA,
-10L))
You could use dplyr's across and lag (combined with modulo for picking every second row):
library(dplyr)
df |> mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
If you want a character vector change NA to "".
Output:
A B new_A new_B
1 2 3 NA NA
2 7 5 0.2857143 0.60000000
3 5 1 NA NA
4 11 21 0.4545455 0.04761905
5 54 67 NA NA
6 12 32 4.5000000 2.09375000
7 34 19 NA NA
8 14 24 2.4285714 0.79166667
9 10 44 NA NA
10 6 37 1.6666667 1.18918919
Function:
ab_fun <- function(data, vars) {
data |>
mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
}
ab_fun(df, c(A,B))
Updated with new data and correct code. + Function
The following code will return the average conditioned that the months are greater than 6.
mean(df[df$delta1>6, "delta1"], na.rm=T)
Now, how do I do apply this for every column in the dataframe?
df:
delta1 delta2 delta3
NA 2 3
4 NA 6
7 8 NA
10 NA 12
NA 14 15
16 NA 18
19 20 NA
The apply-family of functions is useful here:
sapply(df, function(x) mean(x[x>6], na.rm=T))
We can set the values in the dataframe which are less than equal to 6 to NA and count the mean using colMeans ignoring the NA values.
df[df <= 6] <- NA
colMeans(df, na.rm = TRUE)
#delta1 delta2 delta3
# 13 14 15
data
df <- structure(list(delta1 = c(NA, 4L, 7L, 10L, NA, 16L, 19L), delta2 = c(2L,
NA, 8L, NA, 14L, NA, 20L), delta3 = c(3L, 6L, NA, 12L, 15L, 18L,
NA)), class = "data.frame", row.names = c(NA, -7L))
Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]