I have seen may tutorials on making the first row the column names, but nothing explaining how to do the reverse. I would like to have my column names as the first row values and change the column names to something like var1, var2, var3. Can this be done?
I plan to row bind a bunch of data frames later, so they all need the same column names.
q = structure(list(shootings_per100k = c(8105.47466098618, 6925.42653239307
), lawtotal = c(3.00137104283906, 0.903522788541896), felony = c(0.787418655097614,
0.409578330883717)), row.names = c("mean", "sd"), class = "data.frame")
Have:
shootings_per100k lawtotal felony
mean 8105.475 3.0013710 0.7874187
sd 6925.427 0.9035228 0.4095783
Want:
var1 var2 var3
var shootings_per100k lawtotal felony
mean 8105.475 3.0013710 0.7874187
sd 6925.427 0.9035228 0.4095783
edit: I just realized that since I plan to row bind several data frames later, it may be best for them to all have the same column names. I changed the 'want' section to reflect the desired outcome.
q <- rbind(colnames(q), round(q, 4))
colnames(q) <- paste0("var", seq_len(ncol(q)))
rownames(q)[1] <- "var"
q
var1 var2 var3
var shootings_per100k lawtotal felony
mean 8105.4747 3.0014 0.7874
sd 6925.4265 0.9035 0.4096
you can used the names() function to extract column names of the data frame and row bind that to your data frame. Then use the names() function again to override the existing names to any standard value you want.
You can also do as follows.
library(tidyverse)
names <- enframe(colnames(df)) %>%
pivot_wider(-name, names_from = value) %>%
rename_with( ~ LETTERS[1:length(df)])
data <- as_tibble(df) %>%
mutate(across(everything(), ~ as.character(.))) %>%
rename_with(~ LETTERS[1:length(df)])
bind_rows(names, data)
# A tibble: 3 × 3
# A B C
# <chr> <chr> <chr>
# 1 shootings_per100k lawtotal felony
# 2 8105.475 3.001371 0.7874187
# 3 6925.427 0.9035228 0.4095783
Related
I have a large dataset with the two first columns that serve as ID (one is an ID and the other one is a year variable). I would like to compute a count by group and to loop over each variable that is not an ID one. This code below shows what I want to achieve for one variable:
library(tidyverse)
df <- tibble(
ID1 = c(rep("a", 10), rep("b", 10)),
year = c(2001:2020),
var1 = rnorm(20),
var2 = rnorm(20))
df %>%
select(ID1, year, var1) %>%
filter(if_any(starts_with("var"), ~!is.na(.))) %>%
group_by(year) %>%
count() %>%
print(n = Inf)
I cannot use a loop that starts with for(i in names(df)) since I want to keep the variables "ID1" and "year". How can I run this piece of code for all the columns that start with "var"? I tried using quosures but it did not work as I receive the error select() doesn't handle lists. I also tried to work with select(starts_with("var") but with no success.
Many thanks!
Another possible solution:
library(tidyverse)
df %>%
group_by(ID1) %>%
summarise(across(starts_with("var"), ~ length(na.omit(.x))))
#> # A tibble: 2 × 3
#> ID1 var1 var2
#> <chr> <int> <int>
#> 1 a 10 10
#> 2 b 10 10
for(i in names(df)[grepl('var',names(df))])
I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4
I'm trying to calculate the mean of some grouped data, but I'm running into an issue where the mean generated using base::mean() is generating a different value than when I use base:rowMeans() or try to replicate the mean in Excel.
Here's the code with a simplified data frame looking at just a small piece of the data:
df <- data.frame("ID" = 1101372,
"Q1" = 5.996667,
"Q2" = 6.005556,
"Q3" = 5.763333)
avg1 <- df %>%
summarise(new_avg = mean(Q1,
Q2,
Q3)) # Returns a value of 5.99667
avg2 <- rowMeans(df[,2:4]) # Returns a value of 5.921852
The value in avg2 is what I get when I use AVERAGE in Excel, but I can't figure out why mean() is not generating the same number.
Any thoughts?
Here, the mean is taking only the first argument i.e. Q1 as 'x' because the usage for ?mean is
mean(x, trim = 0, na.rm = FALSE, ...)
i.e. the second and third argument are different. In the OP's code, x will be taken as "Q1", trim as "Q2" and so on.. The ... at the end also means that the user can supply n number of parameters without any error and leads to confusions like this (if we don't check the usage)
We can specify the data as ., subset the columns of interest and use that in rowMeans
df %>%
summarise(new_avg = rowMeans(.[-1]))
This would be more efficient. But, if we want to use mean as such, then do a rowwise
df %>%
rowwise() %>%
summarise(new_avg = mean(c(Q1, Q2, Q3)))
# A tibble: 1 x 1
# new_avg
# <dbl>
#1 5.92
Or convert to 'long' format and then do the group_by 'ID' and get the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>% # can skip this step if there is only a single row
summarise(new_avg = mean(value))
# A tibble: 1 x 2
# ID new_avg
# <dbl> <dbl>
#1 1101372 5.92
In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4
I have a data.frame containing survey data on three binary variables. The data is already in a contingency table with the first 3 columns being answers (1=yes, 0 = no) and the fourth column showing the total number of answers. The rows is three different groups.
My aim is to calulate z-scores to check if the proportions are significantly different compared to the total
this is my data:
library(dplyr) #loading libraries
df <- structure(list(var1 = c(416, 1300, 479, 417),
var2 = c(265, 925,473, 279),
var3 = c(340, 1013, 344, 284),
totalN = c(1366, 4311,1904, 1233)),
class = "data.frame",
row.names = c(NA, -4L),
.Names = c("var1","var2", "var3", "totalN"))
and these are my total values
dfTotal <- df %>% summarise_all(funs(sum(., na.rm=TRUE)))
dfTotal
dfTotal <- data.frame(dfTotal)
rownames(dfTotal) <- "Total"
to calculate zScore I use the following formula:
zScore <- function (cntA, totA, cntB, totB) {
#calculate
avgProportion <- (cntA + cntB) / (totA + totB)
probA <- cntA/totA
probB <- cntB/totB
SE <- sqrt(avgProportion * (1-avgProportion)*(1/totA + 1/totB))
zScore <- (probA-probB) / SE
return (zScore)
}
is there a way using dplyr to calculate a 4x3 matrix that holds for all four groups and variables var1 to var3 the z-test-value against the total proportion?
I am currently stuck with this bit of code:
df %>% mutate_all(funs(zScore(., totalN,dftotal$var1,dfTotal$totalN)))
So the parameters currently used here as dftotal$var1 and dfTotal$totalN don't work, but I have no idea how to feed them into the formula. for the first parameter it must not be always var1 but should be var2, var3 (and totalN) to match the first parameter.
z-score in R is handled with scale:
scale(df)
var1 var2 var3 totalN
[1,] -0.5481814 -0.71592544 -0.4483732 -0.5837722
[2,] 1.4965122 1.42698064 1.4952995 1.4690147
[3,] -0.4024623 -0.04058534 -0.4368209 -0.2087639
[4,] -0.5458684 -0.67046986 -0.6101053 -0.6764787
If you want only the three var columns:
scale(df[,1:3])
var1 var2 var3
[1,] -0.5481814 -0.71592544 -0.4483732
[2,] 1.4965122 1.42698064 1.4952995
[3,] -0.4024623 -0.04058534 -0.4368209
[4,] -0.5458684 -0.67046986 -0.6101053
If you want to use your zScore function inside a dplyr pipeline, we'll need to tidy your data first and add new variables containing the values you now have in dfTotal:
library(dplyr)
library(tidyr)
# add grouping variables we'll need further down
df %>% mutate(group = 1:4) %>%
# reshape data to long format
gather(question,count,-group,-totalN) %>%
# add totals by question to df
group_by(question) %>%
mutate(answers = sum(totalN),
yes = sum(count)) %>%
# calculate z-scores by group against total
group_by(group,question) %>%
summarise(z_score = zScore(count, totalN, yes, answers)) %>%
# spread to wide format
spread(question, z_score)
## A tibble: 4 x 4
# group var1 var2 var3
#* <int> <dbl> <dbl> <dbl>
#1 1 0.6162943 -2.1978303 1.979278
#2 2 0.6125615 -0.7505797 1.311001
#3 3 -3.9106430 2.6607258 -4.232391
#4 4 2.9995381 0.4712734 0.438899