Data frame subsecting loop - r

I have a dataframe composed by several paired columns. So, for example, the first column is a list of names and the second column contains numeric values quantifying the variables of the first column. In the third column I have again a list of names and the fourth column is numeric and quantifies variables of the third column and so on.
I now want to automatically subset the first two columns to make a separate dataframe and the third-fourth columns to make a second dataframe. The final aim is to align the rows by name.
For example, from dataframe a
names_a<-c("a","b","c","d")
values_a<-c(1,2,3,4)
names_b<-c("a","b","e","f")
values_b<-c(5,6,7,8)
a<-as.data.frame(cbind(names_a,values_a,names_b,values_b))
I would obtain a dataframe containing names_a and values_a and another dataframe containing names_b and values_b, then aligning them to have dataframe a1:
names_a1<-c("a","b","c","d","e","f")
values_a1<-c(1,2,3,4,0,0)
values_b1<-c(5,6,0,0,7,8)
a1<-as.data.frame(cbind(names_a1,values_a1,values_b1))
Any suggestion?
Thanks in advance for any help

I can help for the first Part of your request. Please see how to create the separated data frames.
names_a<-c("a","b","c","d")
values_a<-c(1,2,3,4)
names_b<-c("a","b","e","f")
values_b<-c(5,6,7,8)
a<-as.data.frame(cbind(names_a,values_a,names_b,values_b))
#When you subset a data frame you focus on observations (rows), not on the variables (columns). You can create 2 new data frames out of the existing one.
#df contain 3+4 Variable
a34 <- data.frame(cbind(as.vector(a$names_b),as.vector(a$values_b)))
colnames(a34) <-c("names_b","values_b")
#then "subset" a (in fact you create a new one and replace it)
a <- data.frame(cbind(as.vector(a$names_a),as.vector(a$values_a)))
colnames(a) <-c("names_a","values_a")
This result in:
> a
names_a values_a
1 a 1
2 b 2
3 c 3
4 d 4
> a34
names_b values_b
1 a 5
2 b 6
3 e 7
4 f 8

Related

How to replace values in the columns of a dataframe based on the values in the other column in R?

I have a dataframe containing the safety data for 100 patients. There are different safety factors for each patient with the size of that specific factor.
v1_d0_urt_redness v1_d0_urt_redness_size v1_d1_urt_redness v1_d1_urt_redness_size ...
P1 1 20
P2 1 NA
P3 0 NA
.
.
.
Here redness=1 means there was redness and redness=0 means there was no redness, and therefore the redness_size was not reported.
In order to find what proportion of the data is missing I need to code the data as follows:
if (the column containing redness=1 & the column containing redness_size=NA) then (the column containing redness_size<-NA) else if (the column containing redness=0 then the column containing redness_size<-0) to have this coded for d0,d1,.. and to repeat this process for the other variables like hardness, swelling and etc. Any ideas how one could implement this in R?
If I understand well what you are trying to do and assuming your dataframe is called df, you can change values of the column redness_size by doing this:
df[df[,endsWith(colnames(df),"_redness")] == 1 & is.na(df[,endsWith(colnames(df),"redness_size")]),endsWith(colnames(df),"redness_size")] <- NA
df[df[,endsWith(colnames(df),"_redness")] == 1, endsWith(colnames(df),"redness_size")] <- 0

Using list of row numbers as criteria to populate field

I have a list of row numbers that represent row containing outliers in a data set. I would like to add an "outlier" column to the original data set that flags the rows containing outliers, but I can't figure out how to use row numbers as criteria in r.
Example:
I have a dataframe like this:
id <-c("a","b","c","d")
values <-c(10,11,22,33)
df<-data.frame(names,values)
id values
1 a 10
2 b 11
3 c 22
4 d 33
And a list like this containing row number (more correctly "row names"):
outliers <-c(2,4)
I'd like to find a way to use the list of row numbers as criteria in something like:
df$outlier_test<-ifelse( if row number is on my list, "outlier","")
to produce something like this:
id values outlier_test
1 a 10
2 b 11 outlier
3 c 22
4 d 33 outlier
Spent quite a while trying to puzzle this out and had inspiration as soon as I posted the question. For anyone else who comes here with this question:
First:
df$rownumber<- row.names(df)
then:
df$outlier_test<- ifelse(df$rownumber %in% outliers,"outlier","")

R data.frame: rowSums of selected columns by grouping vector

I have a data frame with a sequence of numeric columns, surrounded on both sides by (irrelevant) columns of characters. I want to obtain a new data frame that keeps the position of the irrelevant columns, and adds the numeric columns to eachother by a certain grouping vector (or applies some other row-wise function to the data frame, by group). Example:
sample = data.frame(cha1 = c("A","B"),num1=1:2,num2=3:4,num3=11:12,num4=13:14,cha2=c("C","D"))
> sample
cha1 num1 num2 num3 num4 cha2
1 A 1 3 11 13 C
2 B 2 4 12 14 D
with the goal to obtain
> goal
cha1 X1 X2 cha2
1 A 4 24 C
2 B 6 26 D
i.e. I've summed the 4 numeric columns according to the grouping vector gl(2,2,4) = (1,1,2,2) [levels: 1,2]
For a purely numeric data frame I've found the following method:
sample_num = sample[,2:5] #select numeric columns
data.frame(t(apply(sample_num,1,function(row) tapply(row, INDEX=gl(2,2,4),sum))))
I could combine this with re-inserting the character columns to give the intended result, but I'm really looking for a more elegant way. I'm particularly interested in a plyr method if there is one, as I'm trying to migrate to plyr for all my data frame manipulations. I imagine the first step would be to cast the data frame into long format, but I have no idea how to proceed from there.
One 'absolute' requirement is that I cannot do without the gl(n,k,l) method of grouping, as I need this to be applicable to a wide range of data frames and grouping factors.
EDIT: for simplicity assume that I know which columns are the relevant numeric columns. I'm not concerned with how to select them, I'm concerned with how to do my grouped sum without messing up the original data frame structure.
Thanks!
Grpindex<-gl(2,2,4)
goal<-cbind.data.frame(sample["cha1"],(t(rowsum(t(sample[,2:5]), paste0("X",Grpindex)))),sample["cha2"])
Output:
cha1 X1 X2 cha2
1 A 4 24 C
2 B 6 26 D

R select multiple rows by conditional row number

I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!

List all possible occurrences within a column?

I am trying to merge a data.frame and a column from another data.frame, but have so far been unsuccessful.
My first data.frame [Frequencies] consists of 2 columns, containing 47 upper/ lower case alpha characters and their frequency in a bigger data set. For example purposes:
Character<-c("A","a","B","b")
Frequency<-(100,230,500,420)
The second data.frame [Sequences] is 93,000 rows in length and contains 2 columns, with the 47 same upper/ lower case alpha characters and a corresponding qualitative description. For example:
Character<-c("a","a","b","A")
Descriptor<-c("Fast","Fast","Slow","Stop")
I wish to add the descriptor column to the [Frequencies] data.frame, but not the 93,000 rows! Rather, what each "Character" represents. For example:
Character<-c("a")
Frequency<-c("230")
Descriptor<-c("Fast")
Following can also be done:
> merge(adf, bdf[!duplicated(bdf$Character),])
Character Frequency Descriptor
1 a 230 Fast
2 A 100 Fast
3 b 420 Stop
4 B 500 Slow
Why not:
df1$Descriptor <- df2$Descriptor[ match(df1$Character, df2$Character) ]

Resources