Sorting a dataframe by multiple columns - r

I wanted to sort a data frame while the team names in the "name" column stay with the ratings in the "ratings" column. For example, ne has the highest rating of 13.62. I need both "ne" and "13.62" to be sorted to the first position. Here is some of my code:
x <-t(nfl_data)
y <- solve(x)
myfun = function(i) round( (1/13)*(sum(x[,i])) + mean(y[,i]), digits=2 )
ratings = numeric(32)
for (i in 1:32){
ratings[i] = myfun(i)
}
teams <- c('ari','atl','bal','buf','car','chi','cin','cle','dal',
'den','det','grn','hou','ind','jac','kc','mia','min','ne',
'no','nyj','nyg','oak','phi','pit','sd','sea','sf','stl',
'tb','tn','was')
df <- data.frame(teams,ratings)
df[with(df, order(teams, -ratings)), ]
Here is the sample output of df:
teams ratings
1 ari -3.73
2 atl 9.46
3 bal 2.31
4 buf -5.46
5 car -0.69
6 chi 7.57
7 cin 6.69
8 cle -4.23
I get the same results if I try running the ordered data frame code. What am I doing wrong?

Sort on ratings column
df[with(df, order(-ratings)), ]

Related

Dividing a number in a column dfA with a number in a row dfB, based on the column name and the row name in R?

I want to do something similar to index match match in Excel, but depending on the column name in dfA and the row name in dfB.
A subset example: dfA (my data) imported from excel and dfB is always the same (molar weight):
dfA <- data.frame(Name=c("A", "B", "C", "D"), #usually imported df from Excel
Asp=c(2,4,6,8),
Glu=c(1,2,3,4),
Arg=c(5,6,7,8))
> dfA
Name Asp Glu Arg
1 A 2 1 5
2 B 4 2 6
3 C 6 3 7
4 D 8 4 8
X <- c("Arg","Asp","Glu")
Y <- c(174,133,147)
dfB <- data.frame(X,Y)
> dfB
X Y
1 Arg 174
2 Asp 133
3 Glu 147
I would like to divide the matching numbers from dfA with dfB, meaning R would "look up" and "take" the value from dfA and divide it with the value that "matches" in dfB.
So for example take the value from sample named A under column "Arg" = 5, and divide it by the row "Arg" in dfB = 174
5 / 174 = 0.029 and make a new data frame called dfC. Looking like below:
#How R would calculate:
Name Asp Glu Arg
1 A 2/133 1/147 5/174
2 B 4/133 2/147 6/174
3 C 6/133 3/147 7/174
4 D 8/133 4/147 8/174
>dfC
Name Asp Glu Arg
1 A 0.015 0.007 0.029
2 B 0.030 0.014 0.034
3 C 0.045 0.020 0.040
4 D 0.060 0.027 0.046
I hope it makes sense :) I am really stuck and have no clear idea, how I can do this easily. I can only think of some weird work arounds, that take much longer than Excel. But I would like to standardize it, so I can use the R script everytime, I get data from the lab.
Here is a way. match the names of dfA, excluding the first with column dfB$X. Then apply a division to both dfA[-1] and dfB$Y. Finally, bind the result with the Name of dfA.
i <- match(names(dfA)[-1], dfB$X)
tmp <- mapply(\(x, y) x/y, dfA[-1], dfB$Y[i])
cbind(dfA[1], tmp)
#> Name Asp Glu Arg
#> 1 A 0.01503759 0.006802721 0.02873563
#> 2 B 0.03007519 0.013605442 0.03448276
#> 3 C 0.04511278 0.020408163 0.04022989
#> 4 D 0.06015038 0.027210884 0.04597701
Created on 2022-09-12 with reprex v2.0.2
Simpler, note the backticks:
tmp <- mapply(`/`, dfA[-1], dfB$Y[i])
Even simpler, do not create the temp matrix.
cbind(dfA[1], mapply(`/`, dfA[-1], dfB$Y[i]))

Conditionally add values to a new column and replace values in the conditioning column in R

I am working on a project where I need to read files into my environment and afterwards based on the row's name change a value and add new values to new columns: i.e.
X1 Area Mean Min Max file_row_name
55 0.165 31.384 4 82 ./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv
56 0.097 45.867 4 121 ./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv
168 0.042 28.252 20 49 ./Fluorescence Analysis/T0-25-50-10x-1-1.csv
So in the example I want to look at each row's file_row_name and if the rows have the same name, create two variables: Conc & Rep and replace the values at file_row_name so as to look like this:
X1 Area Mean Min Max file_row_name Conc Rep
55 0.165 31.384 4 82 T0 12.5 3
56 0.097 45.867 4 121 T0 12.5 3
168 0.042 28.252 20 49 T0 25 1
So far what I've done is:
my_df$Conc[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- 12.5
my_df$Rep[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- 3
my_df$file_row_name[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv"] <- "T0"
my_df$Conc[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- 12.5
my_df$Rep[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- 3
my_df$file_row_name[my_df$file_row_name == "./Fluorescence Analysis/T0-12.5-150-10x-3.csv"] <- "T0"
But this takes too long and when I try an if clause:
if(my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv",
"./Fluorescence Analysis/T0-12.5-150-10x-3.csv")){
my_df$Conc = "12.5"
my_df$Rep = 3
my_df$file_row_name = "T0"
}
it tells me that:
Warning message:
In if (my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv", :
the condition has longitud > 1 and only the first element will be used
And if I manage to bypass that warning message with another code piece, basically the columns file_row_name Conc and Rep get replaced with the same value and nothing is changed based on condition.
Instead of if (which is not vectorized), we create a logical row index and use to assign
i1 <- my_df$file_row_name %in% c("./Fluorescence Analysis/T0-12.5-150-10x-3-1.csv",
"./Fluorescence Analysis/T0-12.5-150-10x-3.csv")
mydf[i1, c("Conc", "Rep", "file_row_name")] <- list("12.5", 3, "T0")

Extract multiple values from a dataset by subsetting with a vector

I have a data frame called "Navi", with 72 rows that describe all possible combinations of three variables f,g and h.
head(Navi)
f g h
1 40.00000 80 0.05
2 57.14286 80 0.05
3 74.28571 80 0.05
4 91.42857 80 0.05
5 108.57143 80 0.05
6 125.71429 80 0.05
I have a dataset that also contains these 3 variables f,g and h along with several others.
head(dataset1[,7:14])
# A tibble: 6 x 8
h f g L1 L2 Ref1 Ref2 FR
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.02 20 100 53 53 0.501 2.00 2
2 0.02 20 260 67 67 0.200 5.01 5.2
3 0.02 20 420 72 71 0.128 7.83 8.4
4 0.02 20 580 72 72 0.0956 10.5 11.6
5 0.02 20 740 73 73 0.0773 12.9 14.8
6 0.02 20 900 72 71 0.0655 15.3 18
What I'm trying to do is:
for each row in the combinations data frame, filter the dataset by the three variables f,g and h.
Then, if there are exact matches, give me the matching rows of this dataset, then extract the values in the columns "L1" and "FR" in this dataset and calculate the average of them. Save the average value in the vectors "L_M2" and "FR_M2"
If there aren't exact matches, give me the rows where f,g,h in the dataset are closest to f,g,h from the data frame. Then extract all values for L and FR in these rows, and calculate the average. Save the average value in the vectors "L_M2" and "FR_M2".
What I've already tried:
I created two empty vectors where the extracted values shall be saved later on.
Then I am looping over every row of the combinations data frame, filtering the dataset by f,g and h.
The result would be multiple rows, where the values for f,g and h are the same in the dataset as in the row of the combinations data frame.
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
}
The thing is, I don't know what to do from here on. I don't know how to check for rows with closest values by multiple variables, if there are no exact matches...
I only did something more or less similar in the past, but I only checked for the closes "g" value like this:
L_M2 <- vector()
FR_M2 <- vector()
for (i in 1:(nrow(Navi))){
matchingRows[i] <- dataset1[dataset1$P == "input$varP"
& dataset1$Las == input$varLas
& dataset1$Opt == input$varO
& dataset1$f == Navi[i,1]
& dataset1$g == Navi[i,2]
& dataset1$h == Navi[i,3]]
for (i in 1:(nrow(Navi)){
Differences <- abs(Navi[i,2]- matchingRows$G)
indexofMin <- which(Differences == min (Differences))
L_M2 <- append(L_M2, matchingRows$L[[indexofMin]], after = length(L_M2))
FR_M2 <- append(FR_M2, matchingRows$FR[[indexofMin]], after = length(FR_M2))
}
So can anybody tell me how to achieve this extraction process?I am still pretty new to R, so please tell me If I made a rookie mistake or forgot to include some crucial information. Thank you!
First convert your data into dataframe (if not done before).
Navi <- data.frame(Navi)
Savi <- data.frame(dataset1[,7:14])
Then use merge to filter your lines:
df1 <- merge(Navi, Savi, by = c("f","g","h"))
Save "L1" and "FR" average from df1:
Average1 <- ((df1$L1+df1$FR)/2)
Get you your new Navi dataframe which doen not have exact match on f,g,h columns
Navi_new <- Navi[!duplicated(rbind(df1, Navi))[-seq_len(nrow(df1))], ]
For comparing the values with nearest match:
A1 <- vapply(Navi_new$f, function(x) x-Savi$f, numeric(3))
A2 <- apply(abs(A1), 2, which.min)
B1 <- vapply(A1$g, function(x) x-Savi$g, numeric(3))
B2 <- apply(abs(B1), 2, which.min)
C1 <- vapply(B1$g, function(x) x-Savi$g, numeric(3))
C2 <- apply(abs(C1), 2, which.min)
You can use C2 dataframe to get the average of "L1" and "FR" like 3 steps back.

R Function For Loop Data Frame

I apologize if this is a duplicate or a bit confusing - I've searched all around SO but can't seem to apply find what I'm trying to accomplish. I haven't used functions/loops extensively, especially writing from scratch, so I'm not sure if the error is from the function (likely) or from the construct of the data. The basic flow as follows:
Dummy data set - grouping, type, rate, years, months
I'm running lm formula on the data set by grouping with this bit:
coef_models <- test_coef %>% group_by(Grouping) %>% do(model = lm(rate ~ years + months, data = .))
The result of the above gives me intercepts and coefficients for the variables -
what I'm trying to accomplish next (and failing) is for all the coefficients for the estimates that are negative, drop that component out of the equation and rerun the lm with just the positive coefficient. So for example a grouping of states, if the years coefficient is negative, I would want to run lm(rate ~ months, data = . with in the formula.
To get there, with plyr/broom, I'm taking the results and putting them into a data frame:
#removed lines with negative coefficients
library(dplyr)
library(broom)
coef_output_test <- as.data.frame(coef_models %>% tidy(model))
coef_output_test$Grouping <- as.character(coef_output_test$Grouping)
#drop these coefficients and rerun
coef_output_test_rerun <- coef_output_test[!(coef_output_test$estimate >= 0),]
From here, I'm trying to rerun the groupings with issues without the negative variable from the initial run. Because the variables will vary, some instances will be years dropping out, some will be months, I need to pass through the correct column to use. I think this is where I'm getting hung up:
lm_test_rerun_out <- data.frame(grouping=character()
, '(intercept)'=double()
, term=character()
, estimate=double()
, stringsAsFactors=FALSE)
lm_test_rerun <- function(r) {
y = coef_output_test_rerun$Grouping
x = coef_output_test_rerun$term
for (i in 2:nrow(coef_output_test_rerun)){
lm_test_rerun_out <- test_coef %>% group_by(Grouping["y"]) %>% do(model = lm(rate ~ x, data = .))
}
}
lm_test_rerun(coef_output_test_rerun)
I get this error:
variable lengths differ (found for 'x')
The output for function should be something like this dummy output:
Grouping, Term, (intercept), Estimate
Sports, Years, 0.56, 0.0430
States, Months, 0.67, 0.340
I'm surely not fluent in R, and I'm sure the parts above that do work could be done more efficiently, but the output of the function should be the grouping and x variable used, along with the intercept and estimate for each. Ultimately I'll be taking that output and appending back to the original 'coef_models' - but I can't get past this part for now.
EDIT: sample test_coef set
Grouping Drilldown Years Months Rate
Sports Basketball 10 23 0.42
Sports Soccer 13 18 0.75
Sports Football 9 5 0.83
Sports Golf 13 17 0.59
States CA 13 20 0.85
States TX 14 9 0.43
States AK 14 10 0.63
States AR 10 5 0.60
States ID 18 2 0.22
Countries US 8 19 0.89
Countries CA 9 19 0.86
Countries UK 2 15 0.64
Countries MX 21 15 0.19
Countries AR 8 11 0.62
Consider a base R solution with by that slices dataframe by one or more factors for any extended method to run on each grouped subset. Specifically, below will conditionally re-run lm model by checking coefficient matrix and ultimately returns a dataframe with needed values:
Data
txt <- ' Grouping Drilldown Years Months Rate
Sports Basketball 10 23 0.42
Sports Soccer 13 18 0.75
Sports Football 9 5 0.83
Sports Golf 13 17 0.59
States CA 13 20 0.85
States TX 14 9 0.43
States AK 14 10 0.63
States AR 10 5 0.60
States ID 18 2 0.22
Countries US 8 19 0.89
Countries CA 9 19 0.86
Countries UK 2 15 0.64
Countries MX 21 15 0.19
Countries AR 8 11 0.62'
test_coef <- read.table(text=txt, header=TRUE)
Code
df_list <- by(test_coef, test_coef$Grouping, function(df){
# FIRST MODEL
res <- summary(lm(Rate ~ Years + Months, data = df))$coefficients
# CONDITIONALLY DEFINE FORMULA
f <- NULL
if ((res["Years",1]) < 0 & (res["Months",1]) > 0) f <- Rate ~ Months
if ((res["Years",1]) > 0 & (res["Months",1]) < 0) f <- Rate ~ Years
# CONDITIONALLY RERUN MODEL
if (!is.null(f)) res <- summary(lm(f, data = df))$coefficients
# ITERATE THROUGH LENGTH OF res MATRIX SKIPPING FIRST ROW
tmp_list <- lapply(seq(length(res[-1,1])), function(i)
data.frame(Group = as.character(df$Grouping[[1]]),
Term = row.names(res)[i+1],
Intercept = res[1,1],
Estimate = res[i+1,1])
)
# RETURN DATAFRAME OF 1 OR MORE ROWS
return(do.call(rbind, tmp_list))
})
final_df <- do.call(rbind, unname(df_list))
final_df
# Group Term Intercept Estimate
# 1 Countries Months -0.0512500 0.04375000
# 2 Sports Years 0.6894118 -0.00372549
# 3 States Months 0.2754176 0.02941113
Do note: removing negative coeff of first and re-running new model can render the other component negative when previously it was positive.

data frame column total in R

I have data like this (derived using the table() function):
dat <- read.table(text = "responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6", header = TRUE)
dat
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
All I want are row totals, so to create a new row at the bottom that says Total and then in column freq it will show 35 and in percent it will show 100. I am unable to find a solution. colSums doesn't work because of the first column which is a string.
One option is converting to 'matrix' and using addmargins to get the column sum as a separate row at the bottom. But, this will be a matrix.
m1 <- as.matrix(df1[-1])
rownames(m1) <- df1[,1]
res <- addmargins(m1, 1)
res
# freq percent
#A 9 25.7
#B 13 37.1
#C 10 28.6
#D 3 8.6
#Sum 35 100.0
If you want to convert to data.frame
data.frame(responses=rownames(res), res)
Another option would be getting the sum with colSums for the numeric columns (df1[-1]) (I think here is where the OP got into trouble, ie. applying the colSums on the entire dataset instead of subsetting), create a new data.frame with the responses column and rbind with the original dataset.
rbind(df1, data.frame(responses='Total', as.list(colSums(df1[-1]))))
# responses freq percent
#1 A 9 25.7
#2 B 13 37.1
#3 C 10 28.6
#4 D 3 8.6
#5 Total 35 100.0
data
df1 <- structure(list(responses = c("A", "B", "C", "D"), freq = c(9L,
13L, 10L, 3L), percent = c(25.7, 37.1, 28.6, 8.6)),
.Names = c("responses", "freq", "percent"), class = "data.frame",
row.names = c(NA, -4L))
This might be relevant, using SciencesPo package, see this example:
library(SciencesPo)
tab(mtcars,gear,cyl)
#output
=================================
cyl
--------------------
gear 4 6 8 Total
---------------------------------
3 1 2 12 15
6.7% 13% 80% 100%
4 8 4 0 12
66.7% 33% 0% 100%
5 2 1 2 5
40.0% 20% 40% 100%
---------------------------------
Total 11 7 14 32
34.4% 22% 44% 100%
=================================
Chi-Square Test for Independence
Number of cases in table: 32
Number of factors: 2
Test for independence of all factors:
Chisq = 18.036, df = 4, p-value = 0.001214
Chi-squared approximation may be incorrect
X^2 df P(> X^2)
Likelihood Ratio 23.260 4 0.00011233
Pearson 18.036 4 0.00121407
Phi-Coefficient : NA
Contingency Coeff.: 0.6
Cramer's V : 0.531
#akrun I posted it but you already did the same. Correct me if I'm wrong, I think we can just need this without creating a new data frame or using as.list.
rbind(df1, c("Total", colSums(df1[-1])))
Output:
responses freq percent
1 A 9 25.7
2 B 13 37.1
3 C 10 28.6
4 D 3 8.6
5 Total 35 100
sqldf Classes of the data frame are preserved.
library(sqldf)
sqldf("SELECT * FROM df1
UNION
SELECT 'Total', SUM(freq) AS freq, SUM(percent) AS percent FROM df1")
Or, alternatively you can use margin.table and rbind function within R-base. Two lines and voila...
PS: The lines here are longer as I am recreating the data, but you know what I mean :-)
Data
df1 <- matrix(c(9,25.7,13,37.1,10,28.6,3,8.6),ncol=2,byrow=TRUE)
colnames(df1) <- c("freq","percent")
rownames(df1) <- c("A","B","C","D")
Creating Total Calculation
Total <- margin.table(df1,2)
Combining Total Calculation to Original Data
df2 <- rbind(df,Total)
df2
Inelegant but it gets the job done, please provide reproducible data frames so we don't have to build them first:
data = data.frame(letters[1:4], c(9,13,10,3), c(25.7,37.1, 28.6, 8.6))
colnames(data) = c("X","Y","Z")
data = rbind(data[,1:3], matrix(c("Sum",lapply(data[,2:3], sum)), nrow = 1)[,1:3])
library(janitor)
dat %>%
adorn_totals("row")
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
Total 35 100.0

Resources