Create Variable from Each Column in Data Frame - r

I have one data frame that contains columns that I want to look at individually. I am not sure what the common method is for analyzing data individually like this, but I want to create a separate variable/data frame for each column in my original data frame. I know I can subset, but is there a way I can use a for loop (is this the easiest way?) in order to create x new variables from the x columns in my data frame?
For more details on my data frame, I have a product and a corresponding index (which the product is being judged against).
Example data frame:
Date Product 1 Index 1 Product 2 Index 2
1/1/1995 2.89 2.75 4.91 5.01
2/1/1995 1.38 1.65 3.47 3.29
So I would like to create a variable for each product and corresponding index, without manually creating a data frame for each one, or subsetting when i want to analyze the product.

Like someone mentioned in the comments, you can do this by indexing. But if you really want separate vectors for each column in your data frame, you could do it like this:
df <- data.frame(x=1:10, y=11:20, z=21:30)
for (i in colnames(df)) {
assign(i, df[, i])
}

You could index the columns and put them into a new list with each element containing the product/index pair and the date column.
ind <- seq(2, by = 2, length.out = ncol(dat[-1])/2)
(sets <- lapply(ind, function(i) dat[c(1, i:(i+1))]))
# [[1]]
# Date Product1 Index1
# 1 1/1/1995 2.89 2.75
# 2 2/1/1995 1.38 1.65
#
# [[2]]
# Date Product2 Index2
# 1 1/1/1995 4.91 5.01
# 2 2/1/1995 3.47 3.29
If you want, you can then assign these data frames to the global environment with list2env
list2env(setNames(sets, paste0("Set", seq_along(sets))), .GlobalEnv)
Set1
# Date Product1 Index1
# 1 1/1/1995 2.89 2.75
# 2 2/1/1995 1.38 1.65
Set2
# Date Product2 Index2
# 1 1/1/1995 4.91 5.01
# 2 2/1/1995 3.47 3.29
Data:
dat <-
structure(list(Date = structure(1:2, .Label = c("1/1/1995", "2/1/1995"
), class = "factor"), Product1 = c(2.89, 1.38), Index1 = c(2.75,
1.65), Product2 = c(4.91, 3.47), Index2 = c(5.01, 3.29)), .Names = c("Date",
"Product1", "Index1", "Product2", "Index2"), class = "data.frame", row.names = c(NA,
-2L))

This is what attach does. You can just do attach(my_data_frame).
Most people who know what they're doing would tell you this lies somewhere between "unnecessary" and "not a good idea".

Related

Looping over multiple values and storing results in a data frame

I am trying to calculate some macronutrients values obtained from a reference table. These are my inputs:
DF1. The values indicate the amount of portions consumed per day, on a monthly basis.
id Cow Milk Soy Milk Yoghurt (...)
001 0.07 0 0 ...
002 0 0.4 0 ...
003 0.07 0.07 0.13 ...
004 2.5 0 0 ...
... ... ... ... ...
My reference table looks like this:
DF2. Reference values
Food Kcal Proteins Trans Fat Satured Fat (5 more columns)
Cow Milk 91.50 4.95 4.95 3.12 ...
Soy Milk 49.50 4.20 2.85 1.80 ...
Yoghurt 122.00 7.00 6.60 0.18 ...
...... ... ... ... ... ...
What I need to do is:
Multiply portions value of the food times the corresponding value of that food in the reference table for each variable (i.e., kcal, protein, fat...).
Sum all the values obtained for each food in the same variable (sum all kcal, sum all the protein...) for that id.
Consolidate in one data.frame.
So, for example, the kcal and protein values only for id 001 so far should be:
id001
kcal
(0.07*91.5) + (0*49.5) + (0*122) = 6.405
protein
(0.07*4.95) + (0*4.2) + (0*7) = 0.3465
...
And I need to calculate that for all the foods, all the other variables of reference table for that same id and for dozens of other ids.
My final table should look like this:
id
Total Kcal
Total Proteins
...
001
6.405
0.3465
...
...
...
...
...
I was thinking of implementing a loop:
results <- data.frame()
for (i in 1:ncol(df1)) {
kcal <- df1[,i] * df2[i,]
results$kcal <- rbind(results$kcal, kcal)
}
But I don't even know how to make it iterate through each variable while maintaining df1[,i] position, nor make it sum the values once has finalized... never have done such a complex thing before. Any help is appreciated.
Here is a tidyverse option
library(tidyverse)
DF1 %>%
pivot_longer(-id, names_to = "Food", values_to = "portion") %>%
left_join(DF2 %>% pivot_longer(-Food), by = "Food") %>%
group_by(id, name) %>%
summarise(value = sum(value * portion), .groups = "drop") %>%
pivot_wider(names_prefix = "Total ")
## A tibble: 4 × 5
# id `Total Kcal` `Total Proteins` `Total Satured Fat` `Total Trans Fat`
# <int> <dbl> <dbl> <dbl> <dbl>
#1 1 6.40 0.347 0.218 0.347
#2 2 19.8 1.68 0.72 1.14
#3 3 25.7 1.55 0.368 1.40
#4 4 229. 12.4 7.8 12.4
Please note that there is an error in your example calculation for Total Proteins for id001:
(0.07 * 4.95) + (0 * 4.2) + (0 * 7) = 0.198 0.3465
Explanation: We reshape both DF1 and DF2 from wide to long, then do a left-join of long DF1 with long DF2 by "Food". We can then group_by(id, name) (where name gives the name of the quantity from DF2: Kcal, Proteins, Trans Fat, etc.) and calculate the desired quantities as the sum(value * portion), respectively. Finally, we reshape again from long to wide, and add the prefix "Total " to the new wide columns.
Sample data
DF1 <- read.table(text = "id 'Cow Milk' 'Soy Milk' Yoghurt
001 0.07 0 0
002 0 0.4 0
003 0.07 0.07 0.13
004 2.5 0 0", header = T, check.names = F)
DF2 <- read.table(text = "Food Kcal Proteins 'Trans Fat' 'Satured Fat'
'Cow Milk' 91.50 4.95 4.95 3.12
'Soy Milk' 49.50 4.20 2.85 1.80
Yoghurt 122.00 7.00 6.60 0.18", header = T, check.names = F)
Here is a way to achieve this using for loop:
results = data.frame()
for (i in 1:nrow(DF1)) {
df_composition_for_id_i = DF2 %>% filter(Food %in% names(DF1[i,])[DF1[i,]>0])
quantity_food = t(DF1[i,-1])[t(DF1[i,-1])>0]
df_transform = sweep(df_composition_for_id_i[,-1], 1, quantity_food, `*`)
Total = c(i, colSums(df_transform))
names(Total)[1]= "id"
results = rbind(results, Total)
}
names(results) = names(DF2)
names(results)[1] = "id"
> results
id Kcal Proteins Trans Fat Satured Fat
1 1 6.405 0.3465 0.3465 0.2184
2 2 19.800 1.6800 1.1400 0.7200
3 3 25.730 1.5505 1.4040 0.3678
4 4 228.750 12.3750 12.3750 7.8000
Using this for loop you can feed a dataframe with more columns in your DF2 (eg. carbohydrates, vitamins, ...) which will be computed in the loop without more intervention.
Explanation:
In the for loop the first df_composition_for_id_i is a dataframe with only the nutrient present in the current iteration for example when i=3:
i=3
df_composition_for_id_i = DF2 %>% filter(Food %in% names(DF1[i,])[DF1[i,]>0])
df_composition_for_id_i
Food Kcal Proteins Trans Fat Satured Fat
1 Cow Milk 91.5 4.95 4.95 3.12
2 Soy Milk 49.5 4.20 2.85 1.80
3 Yoghurt 122.0 7.00 6.60 0.18
quantity_food is the quantity of each nutrient that will be pass to multiply by row
quantity_food
[1] 0.07 0.07 0.13
df_transform take the first element created in this loop (df_composition_for_id_i) and then multiply by row with the second (quantity_food) (excluding the Food name) using the sweep function:
df_transform = sweep(df_composition_for_id_i[,-1], 1, quantity_food, `*`)
df_transform
Kcal Proteins Trans Fat Satured Fat
1 6.405 0.3465 0.3465 0.2184
2 3.465 0.2940 0.1995 0.1260
3 15.860 0.9100 0.8580 0.0234
Lastly, the sum of this is calculated and id added with some tidy up for the naming and binded by row on the new dataframe:
Total = c(i, colSums(df_transform))
names(Total)[1]= "id"
results = rbind(results, Total)
id Kcal Proteins Trans Fat Satured Fat
1 3 25.73 1.5505 1.404 0.3678

Using rle function with condition on a column in r

My dataset has 523 rows and 93 columns and it looks like this:
data <- structure(list(`2018-06-21` = c(0.6959635416667, 0.22265625,
0.50341796875, 0.982942708333301, -0.173828125, -1.229259672619
), `2018-06-22` = c(0.6184895833333, 0.16796875, 0.4978841145833,
0.0636718750000007, 0.5338541666667, -1.3009207589286), `2018-06-23` = c(1.6165364583333,
-0.375, 0.570800781250002, 1.603515625, 0.5657552083333, -0.9677734375
), `2018-06-24` = c(1.3776041666667, -0.03125, 0.7815755208333,
1.5376302083333, 0.5188802083333, -0.552966889880999), `2018-06-25` = c(1.7903645833333,
0.03125, 0.724609375, 1.390625, 0.4928385416667, -0.723074776785701
)), row.names = c(NA, 6L), class = "data.frame")
Each row is a city, and each column is a day of the year.
After calculating the row average in this way
data$mn <- apply(data, 1, mean)
I want to create another column data$duration that indicates the average length of a period of consecutive days where the values are > than data$mn.
I tried with this code:
data$duration <- apply(data[-6], 1, function(x) with(rle`(x > data$mean), mean(lengths[values])))
But it does not seem to work. In particular, it appears that rle( x > data$mean) fails to recognize the end of a row.
What are your suggestions?
Many thanks
EDIT
Reference dataframe has been changed into a [6x5]
The main challenge you're facing in your code is getting apply (which focuses on one row at a time) to look at the right values of the mean. We can avoid this entirely by keeping the mean out of the data frame, and doing the comparison data > mean to the whole data frame at once. The new columns can be added at the end:
mn = rowMeans(data)
dur = apply(data > mn, 1, function(x) with(rle(x), mean(lengths[values])))
dur
# 1 2 3 4 5 6
# 3.0 1.5 2.0 3.0 4.0 2.0
data = cbind(data, mean = mn, duration = dur)
print(data, digits = 2)
# 2018-06-21 2018-06-22 2018-06-23 2018-06-24 2018-06-25 mean duration
# 1 0.70 0.618 1.62 1.378 1.790 1.2198 3.0
# 2 0.22 0.168 -0.38 -0.031 0.031 0.0031 1.5
# 3 0.50 0.498 0.57 0.782 0.725 0.6157 2.0
# 4 0.98 0.064 1.60 1.538 1.391 1.1157 3.0
# 5 -0.17 0.534 0.57 0.519 0.493 0.3875 4.0
# 6 -1.23 -1.301 -0.97 -0.553 -0.723 -0.9548 2.0

Using a loop to calculate correlation based on subset data in R

I have a large dataset with several products in one column and information on each product including unit retail and quantity by week for the previous several years. I am trying to write a for loop that subsets the data by product name and calculates the correlation between unit retail and quantity for the number of rows for each product.
I have been able to subset the data based on product and calculate the correlation, but there are many products and it would be more beneficial to implement a loop to go through each unique product.
Example of dataset:
`Category Label` `Fiscal Year` `Fiscal Week` `Net Sales` `Extended Quantity` `Unit Retail` `Log QTY` `Log Retail`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 LOOSE CITRUS FY2018 FY2018-P01-W1 170833. 204901. 0.834 12.2 -0.182
2 LOOSE CITRUS FY2018 FY2018-P01-W2 158609. 187650. 0.845 12.1 -0.168
3 LOOSE CITRUS FY2018 FY2018-P01-W3 163580. 196313. 0.833 12.2 -0.182
4 LOOSE CITRUS FY2018 FY2018-P01-W4 146240. 185984. 0.786 12.1 -0.240
5 LOOSE CITRUS FY2018 FY2018-P02-W1 147494. 171036. 0.862 12.0 -0.148
6 LOOSE ONIONS FY2018 FY2018-P01-W1 88802. 78446. 1.13 11.3 0.124
7 LOOSE ONIONS FY2018 FY2018-P01-W2 77365. 66898. 1.16 11.1 0.145
8 LOOSE ONIONS FY2018 FY2018-P01-W3 88026. 75055. 1.17 11.2 0.159
9 LOOSE ONIONS FY2018 FY2018-P01-W4 114720. 97051. 1.18 11.5 0.167
10 LOOSE ONIONS FY2018 FY2018-P02-W1 95746. 82128. 1.17 11.3 0.153
#subset data into own df based on category
allProduce_split <- split(allProduce, allProduce$`Category Label`)
#correlation
cor_produce <- cor(allProduce_split$LOOSE CITRUS$`Unit Retail`,
allProduce_split$LOOSE CITRUS$`Extended Quantity`)
Rather than just return the correlation for the "LOOSE CITRUS' product in the example, I am hoping to have a table that contains single row for each product name with the correlation between unit retail and quantity for all 5 fiscal weeks. For example:
'Category Label' 'Cor'
LOOSE CITRUS .5363807
LOOSE ONIONS .6415218
product C .6498723
Product D -.451258
Product E .0012548
Consider by which is similar to split but then allows any function to be applied on the subsets using a third argument. In your case, your function can build a data frame of product label and correlation result:
df_list <- by(allProduce, allProduce$`Category Label`, function(sub)
data.frame(product = sub$Category_Label[1],
cor_produce = cor(sub$`Unit Retail`,
sub$`Extended Quantity`)
)
)
final_df <- do.call(rbind, unname(df_list))
Alternatively, you can still use the split but then run an lapply:
allProduce_split <- split(allProduce, allProduce$`Category Label`)
df_list <- lapply(allProduce_split, function(sub)
data.frame(product = sub$Category_Label[1],
cor_produce = cor(sub$`Unit Retail`,
sub$`Extended Quantity`)
)
)
final_df <- do.call(rbind, unname(df_list))
Try :
library(dplyr)
df <-allProduce %>% group_by(Category Label) %>% mutate(correlation = cor(Unit Retail,Extended Quantity))

How I can use the function names with for in R

I have a little problem with my R code. I don't know where, but I make a mistake.
The problem is:
I have many file excel with the same names of the columns. I'd like to change the titles of the matrix, with a other titles.
These are five files.
AA <- read_excel("AA.xlsx")
BB <- read_excel("BB.xlsx")
CC <- read_excel("CC.xlsx")
DD <- read_excel("DD.xlsx")
EE <- read_excel("EE.xlsx")
head(AA) #the matrix is the same for the other file.
DATA Open Max Min Close VAR % CLOSE VOLUME
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2004-07-07 00:00:00 3.73 3.79 3.6 3.70 0 21810440
2 2004-07-08 00:00:00 3.7 3.71 3.47 3.65 -1.43 7226890
3 2004-07-09 00:00:00 3.61 3.65 3.56 3.65 0 3754407
4 2004-07-12 00:00:00 3.64 3.65 3.59 3.63 -0.55 850667
5 2004-07-13 00:00:00 3.63 3.63 3.58 3.59 -1.16 777508
6 2004-07-14 00:00:00 3.54 3.59 3.47 3.5 -2.45 1931765
To change the titles fast, I decided to use this code.
t <- list(AA, BB, CC, DD, EE)
for (i in t ) {
names(i) <- c("DATA", "OPE", "MAX", "MIN", "CLO", "VAR%", "VOL")
} #R dosen't give any type of error!
head(AA) #the data are the same, as the for dosen't exits.
Where I was wrong?
Thank you so much in advance.
Francesco
We can do this with lapply. Get the datasets in a list with mget, loop through the list, set the column names to vector of names ('nm1
) and modify the objects in the global environment with list2env
nm1 <- c("DATA", "OPE", "MAX", "MIN", "CLO", "VAR%", "VOL")
lst <- lapply(mget(nm2), setNames, nm1)
list2env(lst, envir = .GlobalEnv)
Or using a for loop, loop through the string of object names and assign the column names to the objects in the global environment
for(nm in nm2) assign(nm, `names<-`(get(nm), nm1))
Or using tidyverse
library(tidyverse)
mget(nm2) %>%
map(set_names, nm1) %>%
list2env(., envir = .GlobalEnv)
data
AA <- mtcars[1:7]
BB <- mtcars[1:7]
CC <- mtcars[1:7]
DD <- mtcars[1:7]
EE <- mtcars[1:7]
nm2 <- strrep(LETTERS[1:5], 2)
I am trying to explain why your code didn't work. In the list t, the address of AA (t[[1]]) is the the same as AA in the global environment. In the for-loop, i initially is the same copy as the data.frame AA in global env. When you change the names of i with names(i) <-, the data.frame i is copied twice. Finally, you are changing the name of a new data.frame i rather than the original data.frame AA in the global environment.
Here is an example to illustrate what I mean (tracemem "marks an object so that a message is printed whenever the internal code copies the object."):
tracemem(mtcars)
# [1] "<0x1095b2150>"
tracemem(iris)
# [1] "<0x10959a350>"
x <- list(mtcars, iris)
for(i in x){
cat('-------\n')
tracemem(i)
names(i) <- paste(names(i), 'xx')
}
# -------
# tracemem[0x1095b2150 -> 0x10d678c00]:
# tracemem[0x10d678c00 -> 0x10d678ca8]:
# -------
# tracemem[0x10959a350 -> 0x10cb307b0]:
# tracemem[0x10cb307b0 -> 0x10cb30818]:

R: How to aggregate with NA values

To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.

Resources