So I have two columns. I need to add a third column. However this third column needs to have A for the first amount of rows, and B for the second specified amount of rows.
I tried adding this data_exercise_3 ["newcolumn"] <- (1:6)
but it didn't work. Can someone tell me what I'm doing wrong please?
Looks like you're having a problem with subsetting a data frame correctly. I'd recommend reviewing this concept before you proceed much further, either via a Coursera course or on a website like this UCLA R learning module on subsetting data frames. Subsetting is a crucial component of data wrangling with R, and you'll go much faster with a solid foundation of the basics!
You can assign values to a subset of a data frame by using [row, column] notation. Since your data frame is called data_exercise_3 and the column you'd like to assign values to is called 'newcolumn', then assuming you want the first 6 rows as 'A' and the next 3 as 'B', you could write it like this:
data_exercise_3[1:6,'newcolumn'] <- 'A'
data_exercise_3[7:9,'newcolumn'] <- 'B'
data_exercise_3$category <- c(rep("A",6),rep("B",6))
Related
I have a data frame called mydata with multiple columns, one of which is Benefits, which contains information about samples whether they are CB (complete responding), ICB (Intermediate) or NCB (Non-responding at all).
So basically the Benefit column is a vector with three values:
Benefit <- c("CB" , "ICB" , "NCB")
I want to make a histogram/barplot based on the number of each one of those. So basically it's not a numeric column. I tried solving this by the following code :
hist(as.numeric(metadata$Benefit))
tried also
barplot(metadata$Benefit)
didn't work obviously.
The second thing I want to do is to find a relation between the Age column of the same data frame and the Benefit column, like for example do the younger patients get more benefit ? Is there anyway to do that ?
THANKS!
Hi and welcome to the site :)
One nice way to find issues with code is to only run one command at the time.
# lets create some data
metadata <- data.frame(Benefit = c("ICB", "CB", "CB", "NCB"))
now the command 'as.numeric' does not work on character-data
as.numeric(metadata$Benefit) # returns NA's
Instead what we want is to count the number of instances of each unique value of the column Benefit, we do this with 'table'
tabledata <- table(metadata$Benefit)
and then it is the barplot function we want to create the plot
barplot(tabledata)
I'm in a very basic class that introduces R for genetic purposes. I'm encountering a rather peculiar problem in trying to follow the instructions given. Here is what I have along with the instructor's notes:
MangrovesRaw<-read.csv("C:/Users/esteb/Documents/PopGen/MangrovesSites.csv")
#i'm going to make a new dataframe now, with one column more than the mangrovesraw dataframe but the same number of rows.
View(MangrovesRaw)
Mangroves<-data.frame(matrix(nrow = 528, ncol = 23))
#next I want you to name the first column of Mangroves "pop"
colnames(Mangroves)<-c(col1="pop")
#i'm now assigning all values of that column to be 1
Mangroves$pop<-1
#assign the rest of the columns (2 to 23) to the entirety of the MangrovesRaw dataframe
#then change the names to match the mangroves raw names
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
I'm not really sure how to assign columns that haven't been named used the $ as we have in the past. A friend suggested I first run
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw
#X338 is the name of the first column from MangrovesRaw
But while this does transfer the data from MangrovesRaw, it comes at the cost of having my column names messed up with X338. added to every subsequent column. In an attempt to modify this I found the following "fix"
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw[,2]
#Mangroves$X338<-MangrovesRaw[,2:22]
#MangrovesRaw has 22 columns in total
While this transferred all the data I needed for the X338 Column, it didn't transfer any data for the remaining 21 columns. The code in # just results in the same problem of having X388. show up in all my column names.
What am I doing wrong?
There are a few ways to solve this problem. It may be that your instructor wants it done a certain way, but here's one simple solution: just cbind() the Mangroves$pop column with the real data. Then the data and column names are already added.
Mangroves <- cbind(Mangroves$pop, MangrovesRaw)
Here's another way:
Mangroves[, 2:23] <- MangrovesRaw
colnames(Mangroves)[2:23] <- colnames(MangrovesRaw)
again I apologize if this is a trivial question. I'm still new to R but I am determined to learn. I have one original .csv file (lets call it Original.Data) that I am working with. I am basically reorganizing that spreadsheet. I am taking each row and transforming it into a column and then stacking each row as columns underneath each other using bind_rows.
I have created the code necessary to organize the first row how I wanted. Now I'm trying to create a loop to do this for all rows in the Original. Here is the code I created:
Rows 1:3 basically headers and are eventually deleted but not yet becuase they serve an initial purpose. Only row 5 is necessary for the final spreadsheet.
New.Data <-data.frame(t(Original.Data[c(1:3,5),]))
colnames(New.Data) <-c("patid","Day","Week","SD")
New.Data[,1] <- New.Data[1,4]
New.Data$Day <- as.numeric(as.character(New.Data$Day))
New.Data$SD <- as.numeric(as.character(New.Data$SD))
New.Data$Week <- as.numeric(as.character(New.Data$Week))
New.Data <- New.Data[ which(New.Data$Day<=New.Data[3,4]),]
This last line is creating a base spreadsheet where I am transferring all of the data to."Base" was already created before hand.
Final.Base <-bind_rows(Base,New.Data)
My idea is basically to do something like naming New.Data as New.Data1 for example and having the loop create multiple New.Data for each row and basically binding them all into the Final.Base. So Row 1 (or row 5 actually) is created in to New.Data 1 and row 2 into New.Data 2, etc.
Also, in my first line of code:
New.Data <-data.frame(t(Original.Data[c(1:3,5),]))
The only thing I believe that would need to change is the last number "5". It would need to increase by one in every iteration of the loop.
Thank you for all of your help.
I'm not totally sure what you want to do. A little bit of data would be good. But here is something that sounds like what you are asking. It takes a column and tacks it on to the next column and so forth to give you a vector of the stacked values"
mtcars
out <- c()
for(i in 1:ncol(mtcars)){
intermediate <- mtcars[,i]
out <- c(out, intermediate)
}
I have this lookup DATA FRAME:
VAR1=c('X1')
VAR2=c('X2')
VAR3=c('X3')
VAR4=c('X4')
VAR5=c('NA')
df<-data.frame(VAR1,VAR2,VAR3,VAR4,VAR5)
which I need to cross reference with a main DATA FRAME so that I select variables X1 to X5. Sometimes, like the example, column 5 is simply NA.
I would typically use something like the below:
main_data <-subset(main_data, select=c(df[1,1],df[1,2],df[1,3]))
main_data <-subset(main_data, select=c(df[1,1:max(col(df))]))
but there are NAs, and moreover I will have a dynamic count of columns and these don't work.
The other idea is to use grepl on main_data but I cannot get it to work with more than one variable at a time:
main_data <- main_data[, grepl(paste0(df[1:max(col(df))], colnames(main_data)))]
I am certain there is a straightforward way to do this but I cannot find it.
With Roman's help I got it:
df<-as.vector(unlist(df))
main_data<-main_data[, names(main_data) %in% df]
I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.