Loop within Loop in R - r

I am trying to figure out how to run two different loops on the same code. I am trying to create a matrix where I am filling a column with the mean of a variable for each year.
Here's the code I am using to do it right now:
matplot2 = as.data.frame(matrix(NA, nrow=16, ncol=4))
matplot2[1,1] = mean(matplot[matplot$Year==2003, 'TotalTime'])
matplot2[2,1] = mean(matplot[matplot$Year==2004, 'TotalTime'])
matplot2[3,1] = mean(matplot[matplot$Year==2005, 'TotalTime'])
matplot2[4,1] = mean(matplot[matplot$Year==2006, 'TotalTime'])
matplot2[5,1] = mean(matplot[matplot$Year==2007, 'TotalTime'])
matplot2[6,1] = mean(matplot[matplot$Year==2008, 'TotalTime'])
matplot2[7,1] = mean(matplot[matplot$Year==2009, 'TotalTime'])
matplot2[8,1] = mean(matplot[matplot$Year==2010, 'TotalTime'])
matplot2[9,1] = mean(matplot[matplot$Year==2011, 'TotalTime'])
matplot2[10,1] = mean(matplot[matplot$Year==2012, 'TotalTime'])
matplot2[11,1] = mean(matplot[matplot$Year==2013, 'TotalTime'])
matplot2[12,1] = mean(matplot[matplot$Year==2014, 'TotalTime'])
matplot2[13,1] = mean(matplot[matplot$Year==2015, 'TotalTime'])
matplot2[14,1] = mean(matplot[matplot$Year==2016, 'TotalTime'])
matplot2[15,1] = mean(matplot[matplot$Year==2017, 'TotalTime'])
matplot2[16,1] = mean(matplot[matplot$Year==2018, 'TotalTime'])
If it were just the year changing, I would write the loop like this:
for(i in 2003:2018) {
matplot2[1,1] = mean(matplot[matplot$Year==i, 'TotalTime'])
}
But, I need the row number in the matrix I'm printing the results into to change as well. How can I write a loop where I am printing the results of all these means into one column of a matrix?
In other words, I need to be able to have it loop matplot2[j,1] in addition to the matplot$Year==i.
Any suggestions would be greatly appreciated!

Your literal calculations of the mean(TotalTime) can all be reduced to a single command (with no for loop required):
matplot2 <- aggregate(TotalTime ~ Year, data = matplot, FUN = mean)
That should return a two-column frame with the unique values of Year in the first column, and the respective means in the second column.
Demonstrated with data I have:
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
res <- aggregate(disp ~ cyl, data = mtcars, FUN = mean)
res
# cyl disp
# 1 4 105.1364
# 2 6 183.3143
# 3 8 353.1000
This and more can be seen in summarize by group (of which this question is essentially a dupe, even if you didn't know to ask it that way).

R is a vectorized language so passing a vector of values for the index and year should work.
i<-1:16
matplot2[i,1] = mean(matplot[matplot$Year==(2002 + i), 'TotalTime'])

Related

Recoding turns everything into the same value in R

I'm practicing R and I created a new column that had continuous numbers in them called ROI, and wanted to recode the number values into string values in R like this:
df = mutate(diabetes_df, ROI = ifelse(ROI < 18.5, 'Under', ROI))
df = mutate(diabetes_df, ROI = ifelse(ROI >= 18.5 & ROI <= 25, 'average', ROI))
diabetes_df = mutate(diabetes_df, ROI = ifelse(ROI > 25 & BMI <= 30, 'above average', ROI))
This works normally and it displays these words wherever the condition is met, however when i put the last ifelse statement :
df = mutate(diabetes_df, ROI = ifelse(ROI > 30, 'OVER', ROI))
It turns every value in the new column I made into the OVER value. I was wondering if anyone knew how to make it so that it would only say OVER for where the condition is met?
If ROI is a numeric column, the issue is that you are overwriting a numeric column with text values.
If ROI is not a numeric column, then inequality comparison on text strings works different from how you have assumed.
Note that all you commands take the form: df = mutate(df, ROI = ifelse(ROI <condition>, 'label', ROI). This means you are overwriting the original ROI values, and the replaced values will we used for subsequent comparisons.
Suppose df had only row with ROI = 10 then:
# df:
# ROI = 10
df2 = mutate(df, ROI = ifelse(ROI < 18.5, 'Under', ROI))
# compares 10 < 18.5
# replaces 10 with 'Under'
# df2:
# ROI = 'Under'
df3 = mutate(df2, ROI = ifelse(ROI > 30, 'OVER', ROI))
# compares 'Under' > 30
# After standardizing formats, compares 'Under' > '30' (conversion to string)
# replaces 'Under' with 'OVER'
Two possible solutions:
write to a different column, this is good practice
df %>%
mutate(ROI_label = NA) %>%
mutate(ROI_label = ifelse(ROI < 18.5, 'Under', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI >= 18.5 & ROI <= 25, 'average', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI > 25 & BMI <= 30, 'above average', ROI_label)) %>%
mutate(ROI_label = ifelse(ROI > 30, 'OVER', ROI_label))
use case_when, this is also good practice
df %>%
mutate(ROI = case_when(ROI < 18.5 ~ 'Under',
ROI >= 18.5 & ROI <= 25 ~ 'average',
ROI > 25 & BMI <= 30 ~ 'above average',
ROI > 30 ~ 'OVER'))
Even better, write to a different column and use case_when.
We can replicate the problem with the mtcars data frame. The following code on the third mutate() statement results in all rows getting the wt value set to High because after the first mutate(), the wt column is a vector of character values.
library(dplyr)
data(mtcars)
mtcars <- mutate(mtcars,wt = ifelse(wt < 2.6,"Low", wt))
# at this point, wt is character
str(mtcars$wt)
> str(mtcars$wt)
chr [1:32] "2.62" "2.875" "Low" "3.215" "3.44" "3.46" "3.57" "3.19" "3.15" ...
By the third mutate() all rows meet the condition of TRUE for the if_else() based on a character string comparison where the string values of Low and Medium are greater than the number 3.61.
mtcars <- mutate(mtcars, wt = ifelse( 2.6 <= wt & wt <= 3.61,"Medium",wt))
mtcars <- mutate(mtcars, wt = ifelse( wt > 3.61,"High",wt))
...and the output:
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 High 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 High 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 High 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 High 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 High 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 High 20.22 1 0 3 1
We can prevent this behavior by using case_when(), which makes all of the comparisons to the numeric version of wt in a single pass of the data.
# use case_when()
data(mtcars)
mtcars %>% mutate(wt = case_when(
wt < 2.6 ~ "Low",
wt >= 2.6 & wt <= 3.61 ~ "Medium",
wt > 3.61 ~ "High"
)) %>% head(.)
...and the output:
head(.)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 Medium 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 Medium 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 Low 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 Medium 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 Medium 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 Medium 20.22 1 0 3 1
>
From the comments to this answer, it wasn't clear to the OP how to save the changed column to the existing data frame. The following code snippet addresses that question.
data(mtcars)
mtcars %>% mutate(wt = case_when(
wt < 2.6 ~ "Low",
wt >= 2.6 & wt <= 3.61 ~ "Medium",
wt > 3.61 ~ "High"
)) -> mtcars

Use a character vector in the `by` argument

Within the data.table package in R, is there a way in order to use a character vector to be assigned within the by argument of the calculation?
Here is an example of what would be the desired output from this using mtcars:
mtcars <- data.table(mtcars)
ColSelect <- 'cyl' # One Column Option
mtcars[,.( AveMpg = mean(mpg)), by = .(ColSelect)] # Doesn't work
# Desired Output
cyl AveMpg
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
I know that this is possible to use assigning column names in j by enclosing the vector around brackets.
ColSelect <- 'AveMpg' # Column to be assigned for average mpg value
mtcars[,(ColSelect):= mean(mpg), by = .(cyl)]
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb AveMpg
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19.74286
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 19.74286
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.66364
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 19.74286
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 15.10000
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 19.74286
Is there a suggestion as to what to put in the by argument in order to achieve this?
From ?data.table in the by section it says that by accepts:
a single character string containing comma separated column names (where spaces are significant since column names may contain spaces
even at the start or end): e.g., DT[, sum(a), by="x,y,z"]
a character vector of column names: e.g., DT[, sum(a), by=c("x", "y")]
So yes, you can use the answer in #cccmir's response. You can also use c() as #akrun mentioned, but that seems slightly extraneous unless you want multiple columns.
The reason you cannot use .() syntax is that in data.table .() is an alias for list(). And according to the same help for by the list() syntax requires an expression of column names - not a character string.
Going off the examples in the by help if you wanted to use multiple variables and pass the names as characters you could do:
mtcars[,.( AveMpg = mean(mpg)), by = "cyl,am"]
mtcars[,.( AveMpg = mean(mpg)), by = c("cyl","am")]
try to use it like this
mtcars <- data.table(mtcars)
ColSelect <- 'cyl' # One Column Option
mtcars[, AveMpg := mean(mpg), by = ColSelect] # Should work

write.fwf column names don't line up with values

The following code produces a table whose column names don't line up with its values:
library( gdata )
test0 <- matrix(5:28, nrow = 4)
row.names(test0) <- paste("r", 1:4, sep = "")
colnames(test0) <- paste("c", 1:6, sep = "")
test0[3, 2] <- 1234567890
test0[ , 3] <- 0.19412341293479123840214
test0 <- format(test0, digits = 5, trim = T, width = 10, scientific = T)
write.fwf(test0, file = paste("test", ".txt", sep = ""), width = 11, rowname = T, colname = T, quote = F)
How can I make column names line up with each column's values (in order to have the table to be readable by GAMS)?
Curious that the column names are not treated in the same way. A work-around could be to add the column names as a row in the table and then write the table without column names...
i.e.
test1 <- rbind( colnames(test0) , test0 )
write.fwf(test1, file = paste("test", ".txt", sep = ""),
width = 11,
rownames = T,
colnames = F, #Don't print the column names
quote = F )
This looks like:
I created a small script for saving a dataframe to fwf format with columns aligned with values.
See the latest gist code
suppressPackageStartupMessages({
library(gdata)
library(stringr)
})
#' Generate automatically .fwf file (fixed width file) in R
#' #description This function creates automatically fixed width file
#' It align columns headers with datas
#' #param df dataframe
#' #param filename filename
#' #param nbspaces nb spaces for columns separator
#' #param replace_na Empty/NA chain replacement
#' #param rowname If it's defined, it convert rownames column to named column
#' #examples write_fwf(mtcars, "carname", "/tmp/mtcars.fwf")
#'
#' # colnames: carname,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
#' # cols: 22,7,6,8,6,7,8,8,5,5,7,7
#' carname mpg cyl disp hp drat wt qsec vs am gear carb
#' Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#' Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#' Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#' Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
write_fwf <- function(df, filename,rowname = FALSE,nbspaces = 3, replace_na = "NA") {
# Convert rownames to column
if (rowname) {
df <- tibble::rownames_to_column(df, rowname)
}
# Convert all columns to character
tmpdf = data.frame(df)
tmpdf[] <- lapply(df, as.character)
# Compute column size
nasize=nchar(replace_na)
maxwidthname <- nchar(colnames(tmpdf))
maxwidthvalue <- sapply(tmpdf, function(x) max(nchar(x)))
maxcols <- pmax(maxwidthname,maxwidthvalue,nasize)
delta <- maxwidthvalue - maxwidthname
# Compute header
header <- c()
for (idx in seq(ncol(df))) {
if (is.character(df[,idx])) {
header <- append(header,paste0(colnames(df)[idx],strrep(" ",max(delta[idx],0))))
} else {
header <- append(header,paste0(strrep(" ",max(delta[idx],0)), colnames(df)[idx]))
}
}
# Open file
file <- file(filename, "w")
# Write header
writeLines(paste("# colnames:", paste(colnames(df), collapse=',')),file)
writeLines(paste("# cols:", paste(unlist(maxcols+nbspaces), collapse=',')),file)
writeLines(header,file, sep=strrep(" ",nbspaces))
writeLines("", file, sep="\n")
close(file)
# Export data
write.fwf(
df,
file=filename,
append=TRUE,
width=maxcols,
colnames=FALSE,
na=replace_na,
sep=strrep(" ",nbspaces),
justify="left"
)
}
#' Read automatically .fwf file (fixed width file) in R
#' #description This function read and detect automatically fixed width file
#' #param maxsearchlines nb lines for the searching the columns metadata description
#' #examples read_fwf("/tmp/mtcars.fwf")
read_fwf <- function(filename,maxsearchlines=100) {
# Search columns informations
file <- file(filename, "r")
on.exit(close(file))
lines <- readLines(file,n=maxsearchlines)
idxname <- str_which(lines,"# colnames: ")
colnames <- str_replace(lines[idxname], "# colnames: ", "")
colnames <- unlist(str_split(colnames, ","))
idxcols <- str_which(lines,"# cols: ")
colwidths <- str_replace(lines[idxcols], "# cols: ", "")
colwidths <- str_split(colwidths, ",")
colwidths <- strtoi(unlist(colwidths))
return(read.fwf(file=filename, skip=idxcols+1, col.names = colnames, widths=colwidths,strip.white=TRUE))
}
Sample utilization
write_fwf(mtcars, "carname", "/tmp/mtcars.fwf")
The result
# colnames: carname,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
# cols: 22,7,6,8,6,7,8,8,5,5,7,7
carname mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

Finding duplicate columns in a data.table

I have a pretty big data.table (500 x 2000), and I need to find out if any of the columns are duplicates, i.e., have the same values for all rows. Is there a way to efficiently do this within the data.table structure?
I have tried a naive two loop approach with all(col1 == col2) for each pair of columns, but it takes too long. I have also tried converting it to a data.frame and using the above approach, and it still takes quite a long time.
My current solution is to convert the data.table to a matrix and use the apply() function as:
similarity.matrix <- apply(m, 2, function(x) colSums(x == m)))/nrow(m)
However, the approach forces the modes of all elements to be the same, and I'd rather not have that happen. What other options do I have?
Here is a sample construction for the data.table:
m = matrix(sample(1:10, size=1000000, replace=TRUE), nrow=500, ncol=2000)
DF = as.data.frame(m)
DT = as.data.table(m)
Following the suggestion of #Haboryme*, you can do this using duplicated to find any duplicated vectors. duplicated usually works rowwise, but you can transpose it with t() just for finding the duplicates.
DF <- DF[ , which( !duplicated( t( DF ) ) ) ]
With a data.table, you may need to add with = FALSE (I think this depends on the version of data.table you're using).
DT <- DT[ , which( !duplicated( t( DT ) ) ), with = FALSE ]
*#Haboryme, if you were going to turn your comment into an answer, please do and I'll remove this one.
Here's a different approach, where you hash each column first and then call duplicated.
library(digest)
dups <- duplicated(sapply(DF, digest))
DF <- DF[,which(!dups)]
Depending on your data this might be a faster way.
I am using mtcars for a reproducible result:
library(data.table)
library(digest)
# Create data
data <- as.data.table(mtcars)
data[, car.name := rownames(mtcars)]
data[, car.name.dup := car.name] # create a duplicated row
data[, car.name.not.dup := car.name] # create a second duplicated row...
data[1, car.name.not.dup := "Moon walker"] # ... but change a value so that it is no longer a duplicated column
data contains now:
> head(data)
mpg cyl disp hp drat wt qsec vs am gear carb car.name car.name.dup car.name.not.dup
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4 Moon walker
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 Datsun 710 Datsun 710
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout Hornet Sportabout Hornet Sportabout
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant Valiant Valiant
Now find the duplicated colums:
# create a vector with the checksum for each column (and keep the column names as row names)
col.checksums <- sapply(data, function(x) digest(x, "md5"), USE.NAMES = T)
# make a data table with one row per column name and hash value
dup.cols <- data.table(col.name = names(col.checksums), hash.value = col.checksums)
# self join using the hash values and filter out all column name pairs that were joined to themselves
dup.cols[dup.cols,, on = "hash.value"][col.name != i.col.name,]
Results in:
col.name hash.value i.col.name
1: car.name.dup 58fed3da6bbae3976b5a0fd97840591d car.name
2: car.name 58fed3da6bbae3976b5a0fd97840591d car.name.dup
Note: The result still contains both directions (col1 == col2 and col2 == col1) and should be deduplicated ;-)

Re-assembling a dataframe after a split [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 6 years ago.
I have trouble applying a split to a data.frame and then assembling some aggregated results back into a different data.frame. I tried using the 'unsplit' function but I can't figure out how to use it properly to get the desired result. Let me demonstrate on the common 'mtcars' data: Let's say that my ultimate result is to get a data frame with two variables: cyl (cylinders) and mean_mpg (mean over mpg for group of cars sharing the same count of cylinders).
So the initial split goes like this:
spl <- split(mtcars, mtcars$cyl)
The result of which looks something like this:
$`4`
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
...
$`6`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
...
$`8`
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
...
Now I want to do something along the lines of:
df <- as.data.frame(lapply(spl, function(x) mean(x$mpg)), col.names=c("cyl", "mean_mpg"))
However, doing the above results in:
X4 X6 X8
1 26.66364 19.74286 15.1
While I'd want the df to be like this:
cyl mean_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Thanks, J.
If you are only interested in reassembling a split then look at (2), (4) and (4a) but if the actual underlying question is really about the way to perform aggregations over groups then they all may be of interest:
1) aggregate Normally one uses aggregate as already mentioned in the comments. Simplifying #alistaire's code slightly:
aggregate(mpg ~ cyl, mtcars, mean)
2) split/lapply/do.call Also #rawr has given a split/lapply/do.call solution in the comments which we can also simplify slightly:
spl <- split(mtcars, mtcars$cyl)
do.call("rbind", lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
3) do.call/by The last one could alternately be rewritten in terms of by:
do.call("rbind", by(mtcars, mtcars$cyl, with, data.frame(cyl = cyl[1], mpg = mean(mpg))))
4) split/lapply/unsplit Another possibility is to use split and unsplit:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(cyl = cyl[1], mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, "[[", "cyl"))
4a) or if row names are sufficient:
spl <- split(mtcars, mtcars$cyl)
L <- lapply(spl, with, data.frame(mpg = mean(mpg), row.names = cyl[1]))
unsplit(L, sapply(L, rownames))
The above do not use any packages but there are also many packages that can do aggregations including dplyr, data.table and sqldf:
5) dplyr
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarize(mpg = mean(mpg)) %>%
ungroup()
6) data.table
library(data.table)
as.data.table(mtcars)[, list(mpg = mean(mpg)), by = "cyl"]
7) sqldf
library(sqldf)
sqldf("select cyl, avg(mpg) mpg from mtcars group by cyl")

Resources