I asked the question earlier here Using variations of `apply` in R. Now I have an extension to that question. In my 40 variables, some variables are categorical. I need the number of observations for each unique quality. I would like to use some form of apply because I have been using sapply and tapply on various parts of this code, but it is not required. Here is a bit of the data:
Age Wt Ht Type Color Width
79 134 66 C red small
67 199 64 C green small
39 135 78 T yellow small
92 149 61 C yellow medium
33 138 75 T green medium
68 139 71 C yellow medium
95 198 62 T red large
65 132 65 T blue large
56 138 81 C green large
71 193 78 T blue large
What the last two columns should look like is
C T
red 1 1
green 2 1
blue 0 2
yellow 2 1
small 2 1
medium 2 1
large 1 3
Also, I know I could use 'table', but how do I send multiple variables one at a time against Type in order to get it to look something like this? Using table as opposed to apply is fine with me.
Thanks!
We can use table after unlisting the 'Color' and 'Width' columns and replicating the 'Type'.
Un1 <- unlist(df1[5:6])
Un2 <- df1$Type[row(df1[5:6])]
If we need a customer order, convert to factor and specify the levels in the same order.
table(factor(Un1, levels = c("red", "green", "blue", "yellow", "small",
"medium", "large")), Un2)
# Un2
# C T
# red 1 1
# green 2 1
# blue 0 2
# yellow 2 1
# small 2 1
# medium 2 1
# large 1 3
Or if the order is based on the order of appearance of unique elements in each of the columns
table(factor(Un1, levels = unique(Un1)), Un2)
Related
I have a data cleaning/transformation problem which I've solved in a way which I'm 1,000% sure could have been solved much more simply.
Below is an example of what my data looks like initially. The first four columns are numebrs I'll use for a lookup, the next is the type of the item, and the last two columns are the ones I want to fill. Based on the value of the column type I would like to fill in the value_one and value_two columns with the values of the same numbered column of the matching type- either one_apple and two_apple or one_orange and two_orange . For example, for the first row if the value is "apple", I would like to fill value_one with the value of one_apple for that row, and value_two with the value of two_apple from that row.
one_apple one_orange two_apple two_orange type value_one value_two
1 23 56 90 orange NA NA
2 24 57 91 orange NA NA
3 25 58 92 apple NA NA
4 26 59 93 apple NA NA
5 27 60 94 orange NA NA
6 28 61 95 apple NA NA
...
This is what I would like that dataframe to look like after I run my code:
one_apple one_orange two_apple two_orange type value_one value_two
1 23 56 90 apple 1 56
2 24 57 91 orange 24 91
3 25 58 92 apple 3 58
4 26 59 93 apple 4 59
5 27 60 94 apple 5 60
6 28 61 95 apple 6 61
...
The way I have solved this right now is to use a for loop, which figures out the index of the columns matching the type value in that row, which(str_sub(names(example_data), start = 5) == example_data$type[i]). Then I use that index to select the correct value for the value_one column from the appropriate place, example_data[i,...)[1]] and assign it to value_one. I do the same thing for value_two.
Below I have code which first creates an example dataset like the one I want to transform, and then shows my for loop running on it to transform the data.
example_data = data.frame(one_apple = 1:(1+30), one_orange = 23:(23+30), two_apple = 56:(56+30), two_orange = 90:(90+30), type = sample(c("apple","orange"), 31, replace = T), value_one = rep(NA,31), value_two = rep(NA,31))
for(i in 1:nrow(example_data)){
example_data$value_one[i] = example_data[i,which(str_sub(names(example_data), start = 5) == example_data$type[i])[1]]
example_data$value_two[i] = example_data[i,which(str_sub(names(example_data), start = 5) == example_data$type[i])[2]]
}
This transformation works, but it is clearly not great code and I feel like I'm missing an easier way to do it with apply and without the convoluted use of which to grab column indexes and stuff. It would be very helpful to see a better way to do this.
Household Size 0 1 2 3 4 5+
Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms Bedrooms
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have the above dataframe where 'Bedrooms' is a label on the columns.
I'm trying to change this into a data table I can then use within rmarkdown to add into a flexdashboard. When I use the below code:
DT::datatable(df, rownames = FALSE, extensions = 'FixedColumns', escape=TRUE,options= list(bPaginate = FALSE))
I get the output:
Household Size 0 1 2 3 4 5+
1 253 4486 2033 930 105 8
2 10 666 3703 947 85 7
3 4 68 1972 1621 52 5
4 1 12 680 1835 164 11
5+ 0 6 147 1230 721 122
I have a few problems with this:
The lables that say 'Bedrooms' don't show, so there's no way of knowing what these numbers in the columns actually mean. I'd like to include the labels or have a Row on top of the column names that says "Number of Bedrooms" that covers all of the rows?
The column Household Size and 5+ have a wider width than the rest of the columns, I want these to either be the same or Household Size to be slightly bigger than the rest
I think it's worth noting that the row 5+ and the column 5+ are both a new row/column that count any value above 5.
Also, this is just an extra but I'd like to colour the bottom left cells red and the top right cells green, is this possible?
I've figured out how to keep 'Bedrooms' in the column titles. It's possible to set the column names within DT::datatable using the code below;
DT::datatable(HS_BED_ALL, rownames = FALSE, colnames=c('Household Size','0 Bedrooms','1 Bedroom','2 Bedrooms','3 Bedrooms','4 Bedrooms','5+ Bedrooms'), extensions = 'FixedColumns', escape=TRUE, options= list(bPaginate = FALSE, dom = 't',buttons = c('excel')))%>%formatStyle(1:7,fontSize = '14px')
Which gives the desired output.
I'm having a problem with nested for loops and ifelse statements. This is my dataframe abund:
Species Total C1 C2 C3 C4
1 Blue 223 73 30 70 50
2 Black 221 17 50 56 98
3 Yellow 227 29 99 74 25
4 Green 236 41 97 68 30
5 Red 224 82 55 21 66
6 Orange 284 69 48 73 94
7 Black 154 9 63 20 62
8 Red 171 70 58 13 30
9 Blue 177 57 27 8 85
10 Orange 197 88 61 18 30
11 Orange 112 60 8 31 13
I would like to add together some of abund’s columns but only if they match the correct species I’ve specified in the vector colors.
colors <- c("Black", "Red", "Blue")
So, if the Species in abund matches the species in color then add columns C2 through C4 together in a new vector minus. If the species in abund does not match the species in color then add a 0 to the new vector minus.
I'm having trouble with my code and hope it's just a small matter of defining a range, but I'm not sure. This is my code so far:
# Use for loop to create vector of sums for select species or 0 for species not selected
for( i in abund$Species)
{
for( j in colors)
{
minus <- ifelse(i == j, sum(abund[abund$Species == i,
"C2"]:abund[abund$Species == i, "C4"]), 0)
}
}
Which returns this: There were 12 warnings (use warnings() to see them)
and this "vector": minus [1] 0
This is my target:
minus
[1] 150 204 0 0 142 0 145 101 120 0 0
Thank you for your time and help with this.
This is probably better done without any loops.
# Create the vector
minus <- rep(0, nrow(abund))
# Identify the "colors" cases
inColors <- abund[["Species"]] %in% colors
# Set the values
minus[inColors] <- rowSums(abund[inColors, c("C2","C3","C4")])
Also, for what it is worth there are quite a few problems with your original code. First, your first for loop isn't doing what you think. In each round, the value of i is being set to the next value in abund$Species, so first it is Blue then Black then Yellow, etc. As a result, then you index using abund[abund$Species == i, ], you may return multiple rows (ex. Blue will give you 1 and 9, since both those rows Species == "Blue").
Second when you make the statement abund[abund$Species == i, "C2"]:abund[abund$Species == i, "C4"] you are not indexing the columns C2 C3 and C4 you are making a sequence starting at the value in C2 and ending at the value in C4. For example, when i == "Yellow" it returns 99:25 or 99, 98, 97, ... , 26, 25. The reason you were getting those warnings was a combination of this problem and the last one. For example, when i == "Blue", you were trying to make a sequence starting at both 30 and 27 and ending at both 50 and 85. The warning was saying that it was just using the first number in your start and finish and giving you 30:50.
Finally, you were constantly over writing your value of minus rather than adding to it. You need to first create minus as above and index into it for the assignment like this minus[i] <- newValue.
Note that ifelse is vectorized so you usually don't need any for loops when using it.
I like Barker's answer best, but if you wanted to do this with ifelse this is the way:
abund$minus = with(abund, ifelse(
Species %in% colors, # if the species matches
C2 + C3 + C4, # add the columns
0 # otherwise 0
))
Even though this is just one line and Barker's is 3, on large data it will be slightly more efficient to avoid ifelse.
However, ifelse statements can be nested and are often easier to work with when conditions get complicated - so there are definitely good times to use them. On small to medium sized data the speed difference will be negligible so just use whichever you think of first.
# Create a column called minus with the length of the number of existing rows.
# The default value is zero.
abund$minus <- integer(nrow(abund))
# Perform sum of C2 to C4 only in those rows where Species is in the colors vector
abund$minus[abund$Species %in% colors] <- rowSums(abund[abund$Species %in% colors,5:7])
I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.
I am trying to set up a bar chart to compare control and experimental samples taken of specific compounds. The data set is known as 'hydrocarbon3' and contains the following information:
Exp. Contr.
c12 89 49
c17 79 30
c26 78 35
c42 63 3
pris 0.5 0.8
phy 0.5 0.9
nap 87 48
nap1 83 44
nap2 78 44
nap3 73 20
acen1 81 50
acen2 86 46
fluor 83 11
fluor1 68 13
fluor2 79 17
dibe 65 7
dibe1 67 6
dibe2 56 10
phen 82 13
phen1 70 12
phen2 65 15
phen3 53 14
fluro 62 9
pyren 48 11
pyren1 34 10
pyren2 19 8
chrys 22 3
chrys1 21 3
chrys2 21 3
When I create a bar chart with the formula:
barplot(as.matrix(hydrocarbon3),
main=c("Fig 1. Change in concentrations of different hydrocarbon compounds\nin sediments with and without the presence of bacteria after 21 days"),
beside=TRUE,
xlab="Oiled sediment samples collected at 21 days",
space=c(0,2),
ylab="% loss in concentration relative to day 0")
I receive this diagram, however I need the control and experimental samples of each chemical be next to each other allow a more accurate comparison, rather than the experimental samples bunched on the left and control samples bunched on the right: Is there a way to correct this on R?
Try transposing your matrix:
barplot(t(as.matrix(hydrocarbon3)), beside=T)
Basically, barplot will plot things in the order they show up in the matrix, which, since a matrix is just a vector wrapped colwise, means barplot will plot all the values of the first column, then all those of the second column, etc.
Check this question out: Barplot with 2 variables side by side
It uses ggplot2, so you'll have to use the following code before running it:
intall.packages("ggplot2")
library(ggplot2)
Hopefully this works for you. Plus it looks a little nicer with ggplot2!
> df
row exp con
1 a 1 2
2 b 2 3
3 c 3 4
> barplot(rbind(df$exp,df$con),
+ beside = TRUE,names.arg=df$row)
produces: