Creating a data frame using a single line code - r

I need to select data for 3 variables and place them in a new data frame using a single line of code. The data frame I'm pulling from is Dance, the 3 variables are Lindy, Blues and Contra.
I have this:
Dance$new<-subset(Dance$Type==Lindy, Dance$Type==Blues, Dance$Type==Contra)
Can you tell what I'm doing wrong?

There are a number of ways you can do this, but I'd forget the subset part
danceNew <- Dance[Dance$Type=="Lindy"|Dance$Type=="Blues"|Dance$Type=="Contra",]
If you only want specific columns
danceNew <- Dance[Dance$Type=="Lindy"|Dance$Type=="Blues"|Dance$Type=="Contra",c("Col1", "Col2")]
Alternatively
danceNew <- Dance[Dance$Type %in% c("Blues", "Contra", "Lindy"),]
Again, if you only want specific columns do the same. The advantage with the final options is you can pass the values in as a variable, thereby making it more dynamic, e.g
danceNames <- c("Lindy", "Blues", "Contra")
danceNew <- Dance[Dance$Type %in% danceNames,]

you're mixing up the variables and the dataframes
this should do the trick..
if your initial dataframe is called "Dance" and the new dataframe is called "Dance.new":
Dance.new <- subset(Dance, Dance$Type=="Lindy" & Dance$Type=="Blues" & Dance$Type=="Contra"); row.names(Dance.new) <- NULL
I like using "row.names(Dance.new) <- NULL" line so I won't have the useless column of "row.names" in the new dataframe

Thanks for your help everyone. This is what ended up working for me.
dancenew<-subset(Dance, Type=="Lindy" | Type== "Blues" | Type=="Contra")

Related

How to change data within a column in a dataset in R

I have created a for loop that goes through each row of a certain column. I want to change the information written in that cell depending on certain conditions, so I implemented an if/else statement.
However, the current problem is that the data is printing out one specific outcome: B.
I tried to combat this problem by exporting using write.csv and importing using read.csv.
When I applied the head() function though, I still got Medium for all rows.
Would anyone be able to help with this please?
Walkthrough the following example step by step. You need to assign for loop variable correctly. Could you show us the data frame where you are changing values? That would be helpful.
#creating new data frame
Df <- data.frame(a=c(1,2,3,4,5),b=c(2,3,5,6,8),c=c(10,4,2,3,7))
for (k in 1:dim(Df)[1]) {
#see how k is utilised and Df$Newcolumn creates new column in existing dataframe
if (Df$a[k]<=3) {
Df$Newcolumn[k] <- "low"
}else if (Df$a[k]>3 && Df$a[k]<=6) {
Df$Newcolumn[k] <- "medium"
}
}
you do not need to use a for loop for creating a new column based upon conditions. You could simply use this:
cool$b<-cool$a
cool$b[cool$a <3]<-"low"
cool$b[cool$a >= 3 & school_data2019$Taxable.Income< 4]<-"Medium"
cool$b[cool$a >= 4 & school_data2019$Taxable.Income < 5]<-"Rich"
cool$b[cool$a >5]<-"Very Rich"

Dynamically change part of variable name in R

I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.

Generate a new column based on data in multiple columns

I have a dataset from a colleague.
In the dataset we record the location where a given skin problem is.
We record up to 20 locations for the skin problem.
i.e
scaloc1 == 2
scaloc2 == 24
scaloc3 == NA
scalocn......
Would mean the skin problem was in place 1 and 24 and nowhere else
I want to reorganise the data so that instead of being like this it is
face 1/0 torso 1/0 etc
So for example if any of scaloc1 to scalocn contain the value 3 then set the value of face to be 1.
I had previously done this in STATA using:
foreach var in scaloc1 scaloc2 scaloc3 scaloc4 scaloc5 scaloc6 scaloc7 scaloc8 scaloc9 scal10 scal11 scal12 scal13 scal14 scal15 scal16 scal17 scal18 scal19 scal20{
replace facescalp=1 if (`var'>=1 & `var'<=6) | (`var'>=21 & `var'<=26)
}
I feel like I should be able to do this using either a dreaded for loop or possibly something from the apply family?
I tried
dataframe$facescalp <-0
#Default to zero
apply(dataframe[,c("scaloc1","scaloc2","scalocn")],2,function(X){
dataframe$facescalp[X>=1 & X<7] <-1
})
#I thought this would look at location columns 1 to n and if the value was between 1 and 7 then assign face-scalp to 1
But didn't work....
I've not really used apply before but did have a good root around examples here and can't find one which accurately describes my current issue.
An example dataset is available:
https://www.dropbox.com/s/0lkx1tfybelc189/example_data.xls?dl=0
If anything not clear or there is a good explanation for this already in a different answer please do let me know.
If I understand your problem correctly, the easiest way to solve it would probably be the following (this uses your example data set that you provided read in and stored as df)
# Add an ID column to identify each patient or skin problem
df$ID <- row.names(df)
# Gather rows other than ID into a long-format data frame
library(tidyr)
dfl <- gather(df, locID, loc, -ID)
# Order by ID
dfl <- dfl[order(dfl$ID), ]
# Keep only the rows where a skin problem location is present
dfl <- dfl[!is.na(dfl$loc), ]
# Set `face` to 1 where `locD` is 'scaloc1' and `loc` is 3
dfl$face <- ifelse(dfl$locID == 'scaloc1' & dfl$loc == 3, 1, 0)
Because you have a lot of conditions that you will need to apply in order to fill the various body part columns, the most efficient rout would probably to create a lookup table and use the match function. There are many examples on SO that describe using match for situations like this.
Very helpful.
I ended up using a variant of this approach
data_loc <- gather(data, "site", "location", c("scaloc1", "scaloc2", "scaloc3", "scaloc4", "scaloc5", "scaloc6", "scaloc7", "scaloc8", "scaloc9", "scal10", "scal11", "scal12", "scal13", "scal14", "scal15", "scal16", "scal17", "scal18", "scal19", "scal20"))
#Make a single long dataframe
data_loc$facescalp <- 0
data_loc$facescalp[data_loc$location >=1 & data_loc$location <=6] <-1
#These two lines were repeated for each of the eventual categories I wanted
locations <- group_by(data_loc,ID) %>% summarise(facescalp = max(facescalp), upperarm = max(upperarm), lowerarm = max(lowerarm), hand = max(hand),buttockgroin = max(buttockgroin), upperleg = max(upperleg), lowerleg = max(lowerleg), feet = max(feet))
#Generate per individual the maximum value for each category, hence if in any of locations 1 to 20 they had a value corresponding to face then this ends up giving a 1
data <- inner_join(data,locations, by = "ID")
#This brings the data back together

R: add column to dataframe, named based on formula

More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.
Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]
It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)

data table and data frame operations

I am using some R code that uses a data table class, instead of a data frame class.
How would I do the following operation in R without having to transform map.dt to a map.df?
map.dt = data.table(chr = c("chr1","chr1","chr1","chr2"), ref = c(1,0,3200,3641), pat = c(1,3020,3022, 3642), mat = c(1,0,3021,0))
parent = "mat"
chrom = "chr1"
map.df<-as.data.frame(map.dt);
parent.block.starts<-map.df[map.df$chr == chrom & map.df[,parent] > 0,parent];
Note: parent needs to be dynamically allocated, its an input from the user. In this example I chose "mat" but it could be any of the columns.
Note1: parent.block.starts should be a vector of integers.
Note2: map.dt is a data table where the column names are c("chr","ref","pat","mat").
The problem is that in data tables I cannot access a given column by name, or at least I couldn't figure out how.
Please let me know if you have some suggestions!
Thanks!
It's a little unclear what the end goal is here, especially without sample data, but if you want to access rows by character name there are two ways to do this:
Columns = c("A", "B")
# .. means "look up one level"
dt[,..Columns]
dt[,get("A")]
dt[,list(get("A"), get("B"))]
But if you find yourself needing to use this technique often, you're probably using data.table poorly.
EDIT
Based on your edit, this line will return the same result, without having to do any as.data.frame conversion:
> map.dt[chr==chrom & get(parent) > 0, get(parent)]

Resources