Creating empty R dataframe and adding data row-by-row - r

I'm new to R and this hurdle may be a case of me crossing my R and Python wires - I apologise if that's the case.
I have some data that is supplied as individual rows. I'd like to create an empty dataframe and add each row of data one at a time. I read several posts that recommend not doing this if possible but, in this case, I think it should be easier. I've read several posts giving solutions to the same problem and I think I've followed them. The code I have so far is:
# Create empty dataframe with 1 column for string and several integer columns:
df = data.frame(name=character(), int_a=integer(), int_b=integer(), int_c=integer(), int_d=integer(), int_e=integer(), stringsAsFactors=FALSE)
# Create a series of lists containing the data
r1 = list(name="Row1", int_a=13234, int_b=567, int_c=566, int_d=53, int_e=11)
r2 = list(name="Row2", int_a=34454, int_b=34, int_c=643, int_d=33, int_e=56)
r3 = list(name="Row3", int_a=73857, int_b=3, int_c=226, int_d=4, int_e=55)
r4 = list(name="Row4", int_a=86754, int_b=346, int_c=384, int_d=35, int_e=59)
r5 = list(name="Row5", int_a=33748, int_b=456, int_c=461, int_d=6, int_e=85)
r6 = list(name="Row6", int_a=97865, int_b=34654, int_c=65, int_d=35, int_e=148)
r7 = list(name="Row7", int_a=36475, int_b=3444, int_c=365, int_d=55, int_e=34)
r8 = list(name="Row8", int_a=84748, int_b=454, int_c=345, int_d=148, int_e=884)
r9 = list(name="Row9", int_a=94848, int_b=23454, int_c=6548, int_d=7, int_e=566)
# Add row by row:
df = rbind(df, r1)
df = rbind(df, r2)
df = rbind(df, r3)
df = rbind(df, r4)
df = rbind(df, r5)
df = rbind(df, r6)
df = rbind(df, r7)
df = rbind(df, r8)
df = rbind(df, r9)
The end result is almost right but there are some errors – it looks like this:
name int_a int_b int_c int_d int_e
2 Row1 13234 567 566 53 11
21 <NA> 34454 34 643 33 56
3 <NA> 73857 3 226 4 55
4 <NA> 86754 346 384 35 59
5 <NA> 33748 456 461 6 85
6 <NA> 97865 34654 65 35 148
7 <NA> 36475 3444 365 55 34
8 <NA> 84748 454 345 148 884
9 <NA> 94848 23454 6548 7 566
And there a series of warnings is generated of the format:
1: In `[<-.factor`(`*tmp*`, ri, value = "Row2") :
invalid factor level, NA generated
Can anyone explain why the strings are not being entered into the dataframe and why the row names are a bit odd?
Thanks in advance.

options(stringsAsFactors = F)
your code ....
options(stringsAsFactors = T)
This will work. Not sure why you can't just specify it in the data frame as the OP did. Would appreciate clarification on this as well

Related

Add new column based on coincidence of two columns

I have these two starter data frames:
df1 <- data.frame("Location" = c('NE', 'SW', 'NW'), "Time" = c('0400', '1620', '2110'), "Assignment" = c('Painter', 'Astronaut', 'Bartender'), "Frequency" = c(84, 122, 139))
df1
Location Time Assignment Frequency
1 NE 0400 Painter 84
2 SW 1620 Astronaut 122
3 NW 2110 Bartender 139
df2 <- data.frame("Location" = c('NE', 'SW', 'NW', 'NW', 'SE'), "Time" = c('0400', '1620', '2110', '2240', '1410'), "Assignment" = c('Scripter', 'Port Patrol', 'Lawyer', 'Supplier', 'Youtuber'), "Frequency" = c(82, 126, 144, 94, 102))
df2
Location Time Assignment Frequency
1 NE 0400 Scripter 84
2 SW 1620 Port Patrol 122
3 NW 2110 Lawyer 139
4 NW 2240 Supplier 94
5 SE 1410 Youtuber 102
Suppose I didn't know which data frame was larger. But in this case, df2>df1 , so now I want to try and see which values of the columns 'Location' AND 'Time' coincide. For these equivalents, add a new column stating 'Coincide'. If not, this column should be NA.
For this, I tried:
df3$NewCol <- NA
df3$NewCol[df1$Location == df2$Location & df1$Time == df2$Time] <- 'Coincide'
or
if(df1$Location == df3$Location & df1$Time == df3$Time) {
df3$NewCol <- 'Coincide'
}
(In this ones I created a new df3 which is a merge of df1 + df2)
But on both of these tries I get the error:
longer object length is not a multiple of shorter object length
Which I believe is a problem on both data frames having different lengths, but how could I overcome this ?
Thanks in advance
Answering the first question of adding a new column with 'Coincide'.
We can do a full join with df1 and df2 which would give all the entries present in both the dataframes irrespective of their size. We can then check for NA values and assign 'Coincide' or NA value based on that.
all_data <- merge(df1, df2, by = c('Location', 'Time'), all = TRUE)
all_data$new_col <- c('Coincide', NA)[(rowSums(is.na(all_data[-c(1:2)])) > 0) + 1]
all_data
# Location Time Assignment.x Frequency.x Assignment.y Frequency.y new_col
#1 NE 0400 Painter 84 Scripter 82 Coincide
#2 NW 2110 Bartender 139 Lawyer 144 Coincide
#3 NW 2240 <NA> NA Supplier 94 <NA>
#4 SE 1410 <NA> NA Youtuber 102 <NA>
#5 SW 1620 Astronaut 122 Port Patrol 126 Coincide
You can then select only the columns that you need for further analysis.

How to remove just the set of numbers with / in between among other strings? [duplicate]

This question already has an answer here:
How can I extract numbers separated by a forward slash in R? [closed]
(1 answer)
Closed 3 years ago.
I need to extract the blood pressure values from a text note that is typically reported as one larger number, "/" over a smaller number, with the units mm HG (it's not a fraction, and only written as such). In the 4 examples below, I want to extract 114/46, 135/67, 109/50 and 188/98 only, without space before or after and place the top number in column called SBP, and the bottom number into a column called DBP.
Thank you in advance for your assistance.
bb <- c("PATIENT/TEST INFORMATION (m2): 1.61 m2\n BP (mm Hg): 114/46 HR 60 (bpm)", "PATIENT/TEST INFORMATION:\ 63\n Weight (lb): 100\nBSA (m2): 1.44 m2\nBP (mm Hg): 135/67 HR 75 (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Coronary artery disease. Hypertension. Myocardial infarction.\nWeight (lb): 146\nBP (mm Hg): 109/50 HR (bpm)", "PATIENT/TEST INFORMATION:\nIndication: Aortic stenosis. Congestive heart failure. Shortness of breath.\nHeight: (in) 64\nWeight (lb): 165\nBSA (m2): 1.80 m2\nBP (mm Hg): 188/98 HR 140 (bpm) ")
BP <- head(bb,4)
dput(bb)
Base R solution:
setNames(data.frame(do.call("rbind", strsplit(trimws(gsub("[[:alpha:]]|[[:punct:]][^0-9]+", "",
gsub("HR.*", "", paste0("BP", lapply(strsplit(bb, "BP"), '[', 2)))), "both"), "/"))),
c("SBP", "DBP"))
We can use regmatches/regexpr from base R to extract the required values, and then with read.table, create a two column data.frame
read.table(text = regmatches(bb, regexpr('\\d+/\\d+', bb)),
sep="/", header = FALSE, stringsAsFactors = FALSE)
# V1 V2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
Or using strcapture from base R
strcapture( "(\\d+)\\/(\\d+)", bb, data.frame(X1 = integer(), X2 = integer()))
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
To create this as new columnss in the original data.frame, use either cbind to bind the output with the original dataset
cbind(data, read.table(text = ...))
Or
data[c("V1", "V2")] <- read.table(text = ...)
Or using extract from tidyr
library(dplyr)
library(tidyr)
tibble(bb) %>%
extract(bb, into = c("X1", "X2"), ".*\\b(\\d+)/(\\d+).*", convert = TRUE)
# A tibble: 4 x 2
# X1 X2
# <int> <int>
#1 114 46
#2 135 67
#3 109 50
#4 188 98
If we don't want to remove the original column, use remove = FALSE in extract
You could use str_match and select numbers which has / in between
as.data.frame(stringr::str_match(bb, "(\\d+)/(\\d+)")[, 2:3])
# X1 X2
#1 114 46
#2 135 67
#3 109 50
#4 188 98
In base R, we can extract the numbers that follow the pattern a/b, split them on '/' and form two columns.
as.data.frame(do.call(rbind, strsplit(sub(".*?(\\d+/\\d+).*", "\\1", bb), "/")))
You can give them the column names as per your choice using setNames or any other method.

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

R: Convert consensus output into a data frame

I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Resources