I've got a large dataframe with 72 rows and 72 columns. All are numbers. I just want to make a table with the sum of the the first row multiplied by the sum of the first column and so on and so forth. So for example, if this is my df
x0 <- c(0,0,11,0)
x0.1 <- c(0,251,0,0)
x0.2 <- c(0,495,0,0)
x0.4 <- c(0,0,0,6)
df <- data.frame(x0,x0.1,x0.2,x0.4)
I want my table to look something like this
1 0
2 124911
3 5445
4 36
Would you be looking for:
rowSums(df)*colSums(df)
Solution using sapply:
sapply(1:nrow(df), function(i){sum(df[i,]*sum(df[,i]))})
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
unite(x, c(x0, x0.1, x0.2, x0.4), sep = "") %>%
mutate(x = sub("^0+(?!$)", "", x, perl=TRUE))
x
1 0
2 2514950
3 11000
4 6
Related
I have a data frame like this
df <- data.frame(Income = c("$100to$200","under$100","above$1000"))
I would like this as output
df_final <- data.frame(Avg = c(150,100,1000))
I would like to extract the numeric value from the income column, if there are two numbers, take the average, if there is only one number, take that number.
A few key steps here. First we need to clean our data, in this case getting rid of the $ makes thing easier. Then we'll split into a From and To column. Finally we need to convert to numeric and calculate the row means.
library(tidyverse)
df %>%
mutate(Income = gsub("$", "", Income, fixed = TRUE)) %>%
separate(Income, "to|under|above", into = c("From", "To")) %>%
mutate_all(.,as.numeric) %>%
mutate(Avg = rowMeans(.,na.rm =TRUE))
From To Avg
1 100 200 150
2 NA 100 100
3 NA 1000 1000
You could try:
library(dplyr)
library(stringr)
df %>%
mutate(across(Income, ~ sapply(str_extract_all(.x, '\\d+'), \(x) {strsplit(x, ',') |>
as.numeric() |> mean()})))
Income
1 150
2 100
3 1000
A stringr approach using gsub to get the numerics, str_squish to remove the white space and str_split to get the entries in case of more then one value.
library(stringr)
data.frame(Avg = sapply(
str_split(str_squish(gsub("[[:alpha:]$]", " ", df$Income)), " "), function(x)
sum(as.numeric(x)) / length(x)))
Avg
1 150
2 100
3 1000
df %>%
transmute(
Avg = stringr::str_extract_all(Income, "(?<=\\$)\\d+") %>%
lapply(as.numeric) %>%
sapply(mean)
)
Avg
1 150
2 100
3 1000
I have a dataframe of values for 50 IDs repeated over 10 iterations. I would like to subset by ID and then perform calculations, and repeat that for each column from x1 to x5. I used a for-loop but it is very inefficient (my actual dataset has a lot more IDs).
Here are the calculations I would like to perform. I've had varying success with the conversion to dplyr:
First calculation, gives me the correct value for x1, but need to repeat for each column from x1 to x5.
V1.x1 <- preds.df %>%
split(.$ID) %>%
sapply(function(ID) {
(ID$x1 - mean(ID$x1))^2 # for X1 only
}) %>%
mean()
A different calculation that involves subtracting from a corresponding value in another df data.pop. My dplyr attempt is wrong even for just x1:
## This is what I want to achieve, which I implemented using for-loop:
# df for for-loop
Bsq.perID <- data.frame(matrix(NA,
nrow = nrow(data.pop), # 50 observations
ncol = 5) # 5 models
# For-loop:
for (ids in 1:nrow(data.pop)){
current.ID <- preds.df[preds.df$ID == ids, ] # get current ID over all 10 iterations
for (i in 1:5){
Bsq.perID[ids, i] <- (mean(current.ID[, i]) - data.pop[ids, "real.val"])^2
}
}
Bsq.values <- colMeans(Bsq.perID)
## My wrong dplyr attempt of the above:
B1.x1 <- preds.df %>%
split(.$ID) %>%
sapply(function(ID) {
(mean(ID$x1) - data.pop$real.val)^2
}) %>%
mean()
The structure of preds.df looks like this:
head(preds.df)
x1 x2 x3 x4 x5 iteration ID
1 20.005984 6.78242996 3.526411 21.463892 8.792720 1 1
2 2.890490 7.28232755 18.670470 6.717213 19.830930 1 2
3 4.868658 24.88117301 1.883913 3.897779 14.371414 1 3
4 6.495532 5.79591685 7.745554 20.153269 7.935672 1 4
5 19.297779 0.05068784 21.744816 14.957751 14.232126 1 5
6 7.090456 22.06322779 8.388263 10.672151 9.921884 1 6
tail(preds.df)
x1 x2 x3 x4 x5 iteration ID
495 16.306927 2.8873609 9.7764755 23.798867 10.246443 10 45
496 4.767296 23.2086303 8.8394391 7.806442 24.898483 10 46
497 19.966301 13.7151699 10.2483011 15.199162 9.658736 10 47
498 18.134534 22.1658901 5.6481757 18.501411 23.787457 10 48
499 7.877636 7.2356274 8.2862336 3.790823 11.610848 10 49
500 8.554774 0.9199501 0.9650191 17.155611 1.158619 10 50
I would approach it like this:
library(dplyr)
library(rio)
preds.df <- import("~/Downloads/preds.df.csv")
data.pop <- import("~/Downloads/data.pop.csv")
## added a row because data.pop is only 49 rows in the data you sent
data.pop <- bind_rows(data.pop, data.pop[1,])
You could use dplyr with mutate() to do this:
dat1 <- preds.df %>%
group_by(ID) %>%
mutate(across(x1:x5, function(x)(x-mean(x))^2))
Then, for the second part you could merge the data
data.pop <- data.pop %>%
mutate(ID = 1:n())
dat2 <- dat1 %>% left_join(data.pop)
Next, summarise on ID to calculate the mean of x1 to x5 within ID, then from each one, you can subtract real.val and square.
dat2 <- dat2 %>%
select(c(ID, x1:x5, real.val)) %>%
group_by(ID) %>%
mutate(across(x1:x5, function(x)(x-real.val)^2)) %>%
summarise_all(mean) %>%
select(-real.val)
Let's say I make a dummy dataframe with 6 columns with 10 observations:
X <- data.frame(a=1:10, b=11:20, c=21:30, d=31:40, e=41:50, f=51:60)
I need to create a loop that evaluates 3 columns at a time, adding the summed second and third columns and dividing this by the sum of the first column:
(sum(b)+sum(c))/sum(a) ... (sum(e)+sum(f))/sum(d) ...
I then need to construct a final dataframe from these values. For example using the dummy dataframe above, it would look like:
value
1. 7.454545
2. 2.84507
I imagine I need to use the next function to iterate within the loop, but I'm fairly lost! Thank you for any help.
You can split your data frame into groups of 3 by creating a vector with rep where each element repeats 3 times. Then with this list of sub data frames, (s)apply the function of summing the second and third columns, adding them, and dividing by the sum of the first column.
out_vec <-
sapply(
split.default(X, rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (sum(x[2]) + sum(x[3]))/sum(x[1]))
data.frame(value = out_vec)
# value
# 1 7.454545
# 2 2.845070
You could also sum all the columns up front before the sapply with colSums, which will be more efficient.
out_vec <-
sapply(
split(colSums(X), rep(1:ncol(X), each = 3, length.out = ncol(X)))
, function(x) (x[2] + x[3])/x[1])
data.frame(value = out_vec, row.names = NULL)
# value
# 1 7.454545
# 2 2.845070
You could use tapply:
tapply(colSums(X), gl(ncol(X)/3, 3), function(x)sum(x[-1])/x[1])
1 2
7.454545 2.845070
Here is an option with tidyverse
library(dplyr) # 1.0.0
library(tidyr)
X %>%
summarise(across(.fn = sum)) %>%
pivot_longer(everything()) %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
summarise(value = sum(lead(value)/first(value), na.rm = TRUE)) %>%
select(value)
# A tibble: 2 x 1
# value
# <dbl>
#1 7.45
#2 2.85
I have a data.frame df with columns A and B:
df <- data.frame(A = 1:5, B = 11:15)
There's another data.frame, df2, which I'm building by various calculations that ends up having generic column names X1 and X2, which I cannot control directly (because it passes through being a matrix at one point). So it ends up being something like:
mtrx <- matrix(1:10, ncol = 2)
mtrx %>% data.frame()
I would like to rename the columns in df2 to be the same as df. I could, of course, do it after I finish building df2 with a simple assigning:
names(df2)<-names(df)
My question is - is there a way to do this directly within the pipe? I can't seem to use dplyr::rename, because these have to be in the form of newname=oldname, and I can't seem to vectorize it. Same goes to the data.frame call itself - I can't just give it a vector of column names, as far as I can tell. Is there another option I'm missing? What I'm hoping for is something like
mtrx %>% data.frame() %>% rename(names(df))
but this doesn't work - gives error Error: All arguments must be named.
Cheers!
You can use setNames
mtrx %>%
data.frame() %>%
setNames(., nm = names(df))
# A B
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Or use purrr's equivalent set_names
mtrx %>%
data.frame() %>%
purrr::set_names(., nm = names(df))
A third option is "names<-"
mtrx %>%
data.frame() %>%
"names<-"(names(df))
We can use rename_all from tidyverse
library(tidyverse)
mtrx %>%
as.data.frame %>%
rename_all(~ names(df))
# A B
# 1 1 6
# 2 2 7
# 3 3 8
# 4 4 9
# 5 5 10
This is not a easy problem for me, to be honest. I have searched quite a long time but there seems no similar question.
Here's how a few rows and columns of my data looks like:
V1 V2 V3
1 74c1c25f4b283fa74a5514307b0d0278 1#11:2241 1#10:249
2 08f5b445ec6b29deba62e6fd8b0325a6 20#7:249 20#5:83
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4#16:249 24:913
So, the cells are in a format like "class(#subclass):value". I want to make a table like this:
V1 1#10 1#11 4#16 20#5 20#7 24
1 74c1c25f4b283fa74a5514307b0d0278 249 2241 0 0 0 0
2 08f5b445ec6b29deba62e6fd8b0325a6 0 0 0 83 249 0
3 4b7f6f4e2bf237b6cc58f57142bea5c0 0 0 249 0 0 913
Because I haven't met this kind of data structure before, I am not sure if this is the best way to store it. But so far, this is the only table format I could come up with. If you have any suggestion about it, please leave a comment.
Then, I first parsed it as the following:
V1 V2_1_1 V2_1_2 V2_2_1 V3_1_1 V3_1_2 V3_2_1
1 74c1c25f4b283fa74a5514307b0d0278 1 11 2241 1 10 249
2 08f5b445ec6b29deba62e6fd8b0325a6 20 7 249 20 5 83
3 4b7f6f4e2bf237b6cc58f57142bea5c0 4 16 249 24 NA 913
Now, I don't know how to convert it to the table format I want. Any package in R can I use to do it?
two links are attached below
original data: https://www.dropbox.com/s/aqay5dn4r3m3kdp/temp1TrainPoiFile.R?dl=0
parsed data:
https://www.dropbox.com/s/0oj8ic1pd2rew0h/temp3TrainPoiFile.R?dl=0
Thank you very much for you help. Please leave a comment if there's any question about it.
Thanks for Walt's and Jack's answer. I used tidyr to solve the problem. Below is how I did it.
Read file
source("temp1TrainPoiFile.R")
gather columns to key-value pair
temp2TrainPoiFile <- temp1TrainPoiFile %>% gather( key=V1, value=data, -V1)
extract to two columns
temp3TrainPoiFile <- temp2TrainPoiFile %>% extract(col=data, into=c("class","value"), regex="(.*):(.*)")
adding row numbers
row <- 1:nrow(temp3TrainPoiFile)
temp3TrainPoiFile <- cbind(row, temp3TrainPoiFile)
spread key-value to two columns
TrainPoiFile <- temp3TrainPoiFile %>% spread(key=class, value=value, fill=0)
This looks like a good example of the use of the tidyr package. Use gather to transform into a two column data frame using column V1 as the key and the other columns as the value column named data, extract to split the data column into class and value columns, and then spread to use the class column as new column names and the value columns as values. Code would look like:
library(tidyr)
library(dplyr)
class_table <- df %>% mutate(row = 1:nrow(.)) %>%
gather( key=V1, value=data, -c(V1,row)) %>%
extract(col=data, into=c("class","value"), regex="(.*):(.*)") %>%
spread(key=class, value=value, fill=0)
Edited to ensure uniqueness of row identifiers. mutate requires dplyr package.
Read in data
data <- source("temp1TrainPoiFile.R")[[1]]
Proper NAs
data[data == ""] <- NA
Reshape it into long format
data <- do.call(rbind, lapply(split(data, data[,"V1"]), function(n) {
id <- n[,1]
n <- na.omit(unlist(n[,-1]))
n <- strsplit(n, ":")
n <- do.call(rbind, lapply(n, function(m) data.frame(column = m[1], value = m[2])))
n <- data.frame(id = id, n)
n}))
Prepare for loop to insert the values into a newly created matrix
id <- unique(data[,"id"])
column <- unique(data[,"column"])
mat <- matrix(data = NA, nrow = length(id), ncol = length(column))
rownames(mat) <- id
colnames(mat) <- column
Insert the values
for(i in 1:nrow(data)) {
mat[data[i, "id"], data[i, "column"]] <- data[i,"value"]}