variable dataframe in R - r

Say I have loaded a csv file into R with two columns (column A and column B say) with real value entries. Call the dataframe df. Is there away of speeding up the following code:
dfm <- df[floor(A) = x & floor(B) = y,]
x <- 2
y <- 2
dfm
I am hoping there will be something akin to function e.g.
dfm <- function(x,y) {df[floor(A) = x & floor(B) = y,]}
so that I can type
Any help much appreciated.

The way that's written right now won't work for a few reasons:
You need to assign values to x and y before you assign dfm. In other words, the lines x <- 2 and y <- 2 must come before the dfm <- ... line.
R doesn't know what A and B are, even if you put them inside the brackets of the dataframe that contains them. You need to write df$A and df$B.
= is the assignment operator, but you're looking for the logical operator ==. Right now your code is saying "Assign the value x to floor(A)" (which doesn't really make sense). You want to tell it "Only choose rows where floor(A) equals x", or floor(A)==x.
So what you want is:
dfm.create <- function(x,y) {df[floor(df$A)==x & floor(df$B)==y,]}
dfm <- dfm.create(2,2)
Note that if you want the dataframe to be called dfm, you don't want to name the function dfm, or you will have to erase the function to make the dataframe.

Related

In R: Can I assign a data frame (say x) of row numbers =1 as the column header for another data frame (say y)?

x is a data frame
print(x)
rownames
1 Height
2 Weight
3 Age
4 Eye-color
I have another data frame y with 2 row of data to which I want to assign headers using data frame x.
print(y)
John 180 150 35 Brown
Smith 153 250 23 Black
Some suggested this this:
x <- x[is.na(x) == F]
(this converted the dataframe x into a character vector. My 1st question is: Is this necessary to add column header?)
colnames(y) <- c(x)
My 2nd question is if I left x as data frame would I still be able to add them on as headers onto data frame y.
Is one way better than the other?
The column names of a data frame are a character vector.
When you set the column names of a data frame, you must set it to either (a) a character vector or (b) something that R can easily convert to a character vector with as.character.
You have a data frame x that includes values, Height, Weight, ..., and you want those values to be the column names of your data frame y. As x is a data frame, not a character vector, and as.character(x) doesn't work great, you will need to extract the values from x that you want to turn into a character vector.
There are many ways to do this. If the column names are in a column of x called rownames (this seems to be the case from your example), you could use these methods:
colnames(y) = x$rownames
colnames(y) = x[, "rownames"]
colnames(y) = x[["rownames"]]
If the column names are in the first column of x, you could use these methods:
colnames(y) = x[, 1]
colnames(y) = x[[1]]
If the column names are in the only column of x, you could use this method:
colnames(y) = unlist(x)
There are slight differences between all of these methods, but which one is "best" depends on context. Generally, it is better to use column names than column numbers, as those tend to be more invariant, and help you catch bugs if the data changes. It also makes for clearer code. The methods with [ and [[ allow you to use variables, e.g., you can do this:
column_with_names = "rownames"
colnames(y) = x[, column_with_names]
But this would not work with $.
In all of the cases, you can combine the extraction of the data from x with the assignment of column names to y, or you could do it separately:
# all at once
colnames(y) = x$rownames
# separately
y_colnames = x$rownames
colnames(y) = y_colnames
Which one is best depends on (a) Do you need to use the original x later for something else? (b) Will you need to use the extracted column names again later? (c) Is the extraction code so long that it makes sense to put it on a separate line for readability?
Some suggested this:
x <- x[is.na(x) == F]
This line of code removes missing values from x. Using == F is a bad habit, it would be better written x[!is.na(x)]. Since the x you show does not have missing values, it doesn't seem relevant to your question.

How to sum the values of different columns in a dataframe looping on the variables names

I'm relatively new to R (used to work in Stata before) so sorry if the question is too trivial.
I've a dataframe with variables named in a sequential way that follows the following logic:
q12.X.Y
where X assumes the values from 1 to 9, and Y from 1 to 5
I need to add together the values of the variables of all the q12.X.Y variables with the Y numbers from 1 to 3 (but NOT those ending with the number 4 or 5)
Ideally I would have written a loop based on the sequential numbers of the variables, namely something like:
df$test <- 0
for(i in 1:9){
for(j in 1:3){
df$test <- df$test+ df$q12.i.j
}
}
That obviously do not work.
I also tried with the command "rowSums" and "subset"
df$test <- rowSums(subset(df,select= ...)
However I find it a bit cumbersome, as the column numbers are not sequential and i do not want to type the name of all the variables.
Any suggestion how to do that?
We can use grep to get the match
rowSums(df[grep("q12\\.[1-9]\\.[1-3]", names(df))])
or if all the column names are present, then use an exact match by creating the column names with paste
rowSums(df[paste0(rep(paste0("q12.", 1:9, "."), 3), 1:3)])

Data.frame of Data.frames

I'm using a data.frame that contains many data.frames. I'm trying to access these sub-data.frames within a loop. Within these loops, the names of the sub-data.frames are contained in a string variable. Since this is a string, I can use the [,] notation to extract data from these sub-data.frames. e.g. X <- "sub.df"and then df[42,X] would output the same as df$sub.df[42].
I'm trying to create a single row data.frame to replace a row within the sub-data.frames. (I'm doing this repeatedly and that's why my sub-data.frame name is in a string). However, I'm having trouble inserting this new data into these sub-data.frames. Here is a MWE:
#Set up the data.frames and sub-data.frames
sub.frame <- data.frame(X=1:10,Y=11:20)
df <- data.frame(A=21:30)
df$Z <- sub.frame
Col.Var <- "Z"
#Create a row to insert
new.data.frame <- data.frame(X=40,Y=50)
#This works:
df$Z[3,] <- new.data.frame
#These don't (even though both sides of the assignment give the correct values/dimensions):
df[,Col.Var][6,] <- new.data.frame #Gives Warning and collapses df$Z to 1 dimension
df[7,Col.Var] <- new.data.frame #Gives Warning and only uses first value in both places
#This works, but is a work-around and feels very inelegant(!)
eval(parse(text=paste0("df$",Col.Var,"[8,] <- new.data.frame")))
Are there any better ways to do this kind of insertion? Given my experience with R, I feel like this should be easy, but I can't quite figure it out.

R: manipulate list of dataframes based on condition

I consider this question difficult, it is way over my level, and I would like some help to learn how to do this myself in the future. If I'm not providing enough information, or providing unclear information, please let me know.
I have a list of dataframes:
d1<-data.frame( Data0 = c("N,R,15,P,D", "_KEY_VALUE_1", -1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25),
Data1 = c("N,15,C,D", "Garden",0.9759,0.7121,0.7376,0.7647,0.7927,0.8209,0.8487,0.8759,0.9021,0.9274,0.9518,
1,1.0249,1.0514,1.0805,1.1132,1.1508,1.1946,1.2462,1.3071,1.3793,1.4649,1.5661,1.6854,1.8254,1.9887))
d2<-data.frame(
Data0=c("N,R,2,I,D","no_flowers",-2 , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,10 ,11) ,
Data1=c("N,15,C,D","Garden",0.8891 ,0.8891,0.9051,1,0.8891,0.8891,0.7907,0.8891,0.9929,0.8891,0.8891,0.8891,0.8891)
)
d3<-data.frame(Data0=c("A,X,15,P,D","_KEY_TEXT_1","Y","N","U"),
Data1=c("N,15,C,D","Garden",1.0834,1,1))
d4<-data.frame(
Data0=c("A,X,15,P,D","_KEY_TEXT_1","Y","Y","Y","Y","Y","Y","N","N","N","N","N","N"),
Data1=c("N,R,3,I,D","house_age",16,18,19,20,21,50,16,18,19,20,21,50),
Data2=c("N,15,C,D","Garden",2.2291,2.0743,1.9369,1.8148,1.7064,1.6102,2.2291,2.0743,1.9369,1.8148,1.7064,1.6102)
)
dfl<-list(d1,d2,d3,d4)
names(dfl)<-c("no_animals","no_flowers","radiation","summer_x_house_age")
If you see the first value of the first columns in each dataframe, the second letter (after the first comma) is either R or X. R stands for Ranged and X stands for not Ranged. I would like, if the letter is "R" (Ranged), to manipulate the column into two columns, i.e. I would like the result for the d1 dataframe to look like this:
For the d4 dataframe, an interaction between "summer" (Y/N) and "house age", we see that only the second column (house age) is ranged, so I would like to do the same as for d1, but for both summer=Y and summer=N.
A little bit of background on the data frames, if it makes things easier to understand:
This is the results of a glm-model I have made outside of R, and I wish to import it to R. The last column of the dataframe is always the beta-values of the regression, and the column(s) before are the variables, which sometimes are categorical (X) and sometimes continous (R). When they are continous/ranged, I must manipulate the column to get "from" and "to", because I want to use this list to calculate probabilities for some data where I have values of the regressors I have used in my glm-model. The upmost number means "from & not including infinity, to & including upmost number", second upmost number means "from & not including upmost number, to & including second upmost number", and so on.
Thnk I've got it.
Define a new function which looks for the key letter (R or X) and returns either a new data frame (if R) or the same data frame (if X).
Rcheck <- function(df){
# Isolate the letter being tested for R or X
key_letter <- substr(as.character(df[1,1]),3,3)
if( key_letter == "R"){ # Proceed if letter is R
# Assign new dataframe
df_new <- df
# Add new column.
df_new[,'Data0_'] <- as.character(df_new[,'Data0'])
# Shift down and add -9999 value
rows <- nrow(df_new)
df_new[,'Data0_'][4:rows] <- as.character(df_new[,'Data0'][3:(rows-1)])
df_new[,'Data0_'][3] <- "-9999"
# Take new column from the end and put it beside Data0
column1_name <- colnames(df_new)[1]
new_column_name <- colnames(df_new)[ncol(df_new)]
other_column_names <- colnames(df_new)[2:(ncol(df_new)-1)]
df_new <- df_new[,c(column1_name, new_column_name, other_column_names)]
df_new
} else{ # If letter is not R
df
}
}
Then apply this function to your list of data frames using lapply.
new_list <- lapply(dfl, Rcheck)

Storing numeric vectors in the names of a list

Is it possible to store a numeric vector in the names variable of a list?
ie.
x <- c(1.2,3.4,5.9)
alist <- list()
alist[[x]]$somevar <- 2
I know I can store it as a vector within the list element, but I thought it would be faster to move through and find the element of the list I want (or add if needed) if the numeric vector is the name of the list element itself...
EDIT:
I have included a snippit of the code in context below, apologies for the change in nomenclature. In brief, I am working on a clustering problem, the dataset is too large to directly do the distance calculation on, my solution was to create bins for each dimension of the data and find the nearest bin for each observation in the original data. Of course, I cannot make a complete permutation matrix since this would be larger than the original data itself. Therefore, I have opted to find the nearest bin for each dimension individually and add it to a vector, temp.bin, which ideally would become the name of the list element in which the rowname of the original observation would be stored. I was hoping that this would simplify searching for and adding bins to the list.
I also realise that the distance calculation part is likely wrong - this is still very much a prototype.
binlist <- list()
for(i in 1:nrow(data)) # iterate through all data points
{
# for each datapoint make a container for the nearest bin found
temp.bin <- vector(length = length(markers))
names(temp.bin) <- markers
for(j in markers) # and dimensions
{
# find the nearest bin for marker j
if(dist == "eucl")
{
dists <- apply(X=bin.mat, MARGIN = 1, FUN= function(x,y) {sqrt(sum((x-y)^2))}, y=data[i,j])
temp.bin[j] <- bin.mat[which(dists == min(dists)),j]
}
}
### I realise this part doesn't work
binlist[[temp.bin]] <- append(binlist[[temp.bin]], values = i)
The closest answer so far is John Coleman.
names(alist) is a character vector. A numeric vector is not a string, hence it isn't a valid name for a list element. What you want is thus impossible. You could create a string representation of such a list and use that as a name, but that would be cumbersome. If this is what you really wanted to do, you could do something like the following:
x <- c(1.2,3.4,5.9)
alist <- list()
alist[[paste(x,collapse = " ")]]$somevar <- 2
This will create a 1-element list whose only element has the name "1.2 3.4 5.9".
While there might be some use cases for this, I suspect that you have an XY problem. What are you trying to achieve?
Solution
With some slight modifications we can achieve the following:
x = c(1.2,3.4,5.9)
alist = vector("list", length(x))
names(alist) = x
alist[as.character(x)] = list(c(somevar = 2))
#$`1.2`
#somevar
# 2
#
#$`3.4`
#somevar
# 2
#
#$`5.9`
#somevar
# 2
Explanation
Basically:
I had to create the list with the correct length (vector("list", length(x)))
Then assign the correct names (names(alist) = x)
So we can call list levels by name using [ and assign a new list to each list element (alist[as.character(x)] = list(c(somevar = 2)))
2nd Solution
Going by John Coleman comment:
It isn't clear that you answered the question. You gave a list whose
vector of names is the the vector x, coerced to character. OP asked if
it was possible "if the numeric vector is the name of the list element
itself... ". They wanted to treat x as a single name, not a vector of
names.
If you wanted to have the list element named after the vector x you could try, using the deparse(substitute(.)) trick
x = c(1.2,3.4,5.9)
alist = list()
alist[[deparse(substitute(x))]]$somevar = 2
#> alist[[deparse(substitute(x))]]
#$somevar
#[1] 2
If you really wanted the numeric values in x as the name itself, then I point you to John's solution.

Resources