R: subtract 1 from specific column, based on value other column - r

I'm trying do a basic operation (let's say, subtract 1) on some column, referenced by the value of another column (let's say the first column). For example: for a given row, the value in the first column is equal to 5. Then I want to subtract 1 from the same row, column 5.
Data looks like
tmp <- data.frame(replicate(5,sample(2:5,5,rep=TRUE)))
Of course, I can achieve this by a multiple row code, each time selecting a subsample of the total rows, satisfying a certain condition, but I'm sure this operation could be performed more clean and dynamically.
Additional question: is there an easy way to reference in the same manner to column names, rather than the index. For example the column name "S5"? Easiest way is through names(tmp), and then match the name of the column and use the index of names, or can you think of an easier way?
Any suggestions?

So the column index of the element is in the first column of the row:
for (i in 1:nrow(tmp)) tmp[i, tmp[i,1]] <- tmp[i, tmp[i,1]] - 1 
It also can work if in the fist column are the names of the columns as character (not as factor!):
tmp <- data.frame(cInd=c("A", "B", "C"), A=1:3, B=11:13, C=21:23,
stringsAsFactors = FALSE)
for (i in 1:nrow(tmp)) tmp[i, tmp[i,1]] <- tmp[i, tmp[i,1]] - 1

Related

Pasting the first row to the column name within a list

I have 68 data files- all with the same identifiers-but with different indicators. I converted these individual files into a list with each data frame as a separate element.
The first row of every data frame is a year, which I would like to paste to the column name. I want to be able to separate it by "_".
For example, right now the column name is Arbeitslose, and the row under it has 2018. I would like the column name to become Arbeitslose_2018.
I know how to do this on a single data frame. The code I used is below.
RAW_2[1,] <- as.character(RAW_2[1,]) # Converting the fist row to a character.
colnames(RAW_2) <- paste(colnames(RAW_2),RAW_2[1, ], sep = "_") # Paste Year (Row 2) and columnname
RAW_2 <- RAW_2[rownames(RAW_2) != 1, ] # Drop 1st row which is the years - now abundant
but I dont know how to do this for a list.
I cannot merge the data frames into a single one, because the column names are not unique. I would need to do this step for me to be able to merge it into a data set and proceed. I'm forced to work with lists, something I am horrible with.
Is there an easy way to do this? I am quite lost on how to proceed.
You can use lapply()
rename_col <- function(x){
colnames(x) <- paste0(colnames(x),x[1,],sep="_")
x[-1,]
}
#df_list as your list of data.frames
lapply(df_list,rename_col)

Write Loop To Perform Function through Column Names

I have a dataset with a quantitative column for which I want to calculate the mean based on groups. The other columns in the dataset are titled [FY2001,FY2002,...,FY2018]. These columns are populated with either a 1 or 0.
I want to calculate the mean of the first column for each of the FY columns when they equal 1. So, I want 18 different means.
I am used to using macros in SAS where I can replace parts of a dataset name or column name using a let statement. This is my attempt at writing a loop in R to solve this problem:
vector = c("01","02","03","04","05","06","07","08","09","10",
"11","12","13","14","15","16","17","18")
varlist = paste("FY20", vector, sep = "")
abc = for (i in length(varlist)){
table(ALL_FY2$paste(varlist)[i])
}
abc
This doesn't work since it treats the paste function as a column. What am I missing? Any help would be appreciated.
We can use [[ instead of & to subset the column. In addition, 'abc' should be a list which gets assigned with the corresponding table output of each column in the for loop.
abc <- vector("list", length(varlist)) # initialize a `list` object
Loop through the sequence of 'varlist' and not the length(varlist) (it is a single number)
for(i in seq_along(varlist)) abc[[i]] <- table(ALL_FY2[[varlist[i]]])
However, if we need to have a single table output from all the columns mentioned in the 'varlist', unlist the columns to a vector and replicate the sequence of columns before applying the table
ind <- rep(seq_along(varlist), each = nrow(ALL_FY2))
table(ind, unlist(ALL_FY2[varlist]))

R: strsplit in each column; Error: replacement element 1 has [y] rows to replace 1 rows

I'm working with a data frame (I'll call it 'letters') in R where there are 15 rows by 2 columns. Each column 2 contains a character string like "A|B|C|D|E". I want to split the string at each place a | appears to get the vector c("A", "B", "C", "D", "E"). Here's my best idea of how to do this:
for(i in 1:nrow(letters)){
letters[i,2] <- strsplit(letters[i,2], split = "[|]")
}
I get a similar error as discussed here ("replacement has [x] rows, data has [y]"), and it seems to be trying to make a separate column for each index of the output vector. I'm sure this is a simple question, but I am new to R and stuck.
Is strsplit(letters[i,2], split = "[|]")[[1]] what you're looking for? You won't be able to put that vector back in letters[i,2], though as it has length 5 (instead of 1).
Your second column is (I think) a character vector. strsplit, as it mentions in the documentation (?strsplit) returns a list. Before we get into why your specific situation happened, some general advice:
Make a new column instead of replacing an existing one. This has the added benefit of not losing the original values.
Only replace values in a column with new values of the same class (e.g., character for character, integer for integer).
So I suggest adding a new column of split values:
letters[["splits"]] <- strsplit(letters[[2]], split = "|", fixed = TRUE)
You now have a list column, and each row of this column has a vector of the split letters from the original values.
Why your problem happened
Let's dissect the assignment statement:
letters[i,2] <- strsplit(letters[i,2], split = "[|]")
On the left side of <- is letters[i, 2], which is a data.frame. A data.frame stores all of its data in a list. R allows us to use this fact, especially in assignment. We can add or replace columns just like adding or replacing items in a list.
# This...
letters[, "one"] <- 1
letters[, "two"] <- 2
# is effectively the same as this
letters[, c("one", "two")] <- list(1, 2)
To the right of ->, we have a call to strsplit(), which returns a list. As in the example just above, if you assign a list to a subset of a data.frame, it will be coerced into a data.frame itself. Each element of the list will be considered a column. So, the assignment plays out like this:
If letters[i,2] is "A|B|C|D|E", then strsplit(letters[i,2], split = "[|]") is list(c("A", "B", "C", "D", "E")).
The assignment checks both sides, and sees the data.frame as a the "higher" type, so it coerces the list to a data.frame. The right side is now effectively data.frame(c("A", "B", "C", "D", "E")).
Now it tries to assign a data.frame with 1 column and 5 rows to a subset with 1 column and 1 row. Those dimensions don't match, so it takes what it can from the right side (just the first row) and warns you about what happened.
Why the suggested assignment works
So why isn't there any coercion in this?
letters[["splits"]] <- strsplit(letters[[2]], split = "|", fixed = TRUE)
The left side uses [[ subsetting (treating the data.frame like a list) to add or replace the "splits" column. So no coercion is ever done.
Also, a data.frame can have a list as a column, just like a list can have a list as an element. A data.frame column just has to satisfy two things:
It has to be a vector.
Its length must be equal to the number of rows in the data.frame (recycling's attempted if necessary).
A list is a type of vector. And strsplit() returns a list the same length as its input, so both criteria are met.

Randomly selecting dataframe column. Avoid sampling same column again

Is there a way to random pick a column in a dataframe and then avoid randomly pick it again? This should pick a random column
random_data_vector = data[, sample(ncol(data), 1)]
but I'm not sure how to avoid picking the column again. I thought about removing the column completely but there might be a better approach
You can first sample the columns with
random_cols <- sample(ncol(data))
and then select the random vectors like this
random_data_vector1 <- my_df[, random_cols[1]]
random_data_vector2 <- my_df[, random_cols[2]]
The default setting of sample is replace = FALSE, thus in the random_cols vector you won't have duplicated numbers and you won't select one column twice.

the use of minus sign inside square brackets

Below is an exercise from Datacamp.
Using the cbind() call to include all three sheets. Make sure the first column of urban_sheet2 and urban_sheet3 are removed, so you don't have duplicate columns. Store the result in urban.
Code:
# Add code to import data from all three sheets in urbanpop.xls
path <- "urbanpop.xls"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE)
# Extend the cbind() call to include urban_sheet3: urban
urban <- cbind(urban_sheet1, urban_sheet2[-1],urban_sheet3[-1])
# Remove all rows with NAs from urban: urban_clean
urban_clean<-na.omit(urban)
My question is why using [-1] to remove the first column in cbind. Is it a special use of square brackets inside cbind()? Does that mean that if I want to remove the first two columns the code should be urban_sheet2[-2]? I only know that square brackets are used for selecting certain columns or rows. This confuses me.
This is not specific to cbind(). You can use - inside square brackets to remove any particular row or column you want. If your data frame is df, df[,-1] will have its first column removed. df[,-2] will have its second (and only second) column removed. df[,-c(1,2)] will have both its first and second columns removed. Likewise, df[-1,] will have its first row removed, etc.
This cannot be done with column names, e.g., df[,-"var1"] will not work. To use column names, you can use which(), as in df[,-which(names(df) %in% "var1")], but simply df[,!names(df) %in% "var1")] is easier and yields the same result. You can also use subset(): subset(df, select = -c(var1, var2)); this will remove the columns named "var1" and "var2".
Note that removing rows and columns only affects the output of the call, and will not affect the original object unless the output is assigned to the original object.

Resources