R Plot values that select column based on a provided index value - r

I'm trying to figure out how to plot some values in a peculiar way. Say I have the example data below:
set.seed(100)
test.df <- as.data.frame(matrix(1:36,nrow=6))
test.df$V7 <- sample(1:6,6)
test.df$V8 <- seq(1:6)
colnames(test.df) <- c("col1","col2","col3","col4","col5","col6","index","id")
test.df
col1 col2 col3 col4 col5 col6 index id
1 1 7 13 19 25 31 2 1
2 2 8 14 20 26 32 6 2
3 3 9 15 21 27 33 3 3
4 4 10 16 22 28 34 1 4
5 5 11 17 23 29 35 4 5
6 6 12 18 24 30 36 5 6
I want to plot values from the first 6 columns by using the "index" column as a means of selecting which column (1-6) to choose from. This would be the y axis. The x axis would be "id". Essentially, the first y value would be 7 because index selects column 2 for the first value. The second y value would 32 because the index value indicates column 6.
Please let me know if I can clarify anything else. I'm fairly new to plotting in R (ggplot2 or otherwise), so any help is much appreciated.

This is not a problem of ggplot2.
First, you can create a column `y':
test.df[, "y"] <- 0
for (i in (1:nrow(test.df))) {
test.df[i, "y"] <- test.df[i, paste0("col", test.df[i, "index"])]
}
Then you can do the plotting, with plot:
plot(y ~ id, data = test.df, type = "l")

Related

For Loop Adding Extra Rows to The Data Frame

Hello I am very new to the programming world and data science as well, and I am trying to work my way through it.
I am trying to assign values to the column in a data frame and using for loop such that the data frame is divided into ten groups and every row in every group is assigned a rank, such that row 1 to 10 is assigned as rank 1 and row 11 to 20 is assigned as rank 2 and so on. The original dimension of subset data set is 100 * 6
My data frame looks like
Data Frame
The codes I have written are:
x <- round(nrow(subset) / 10)
a=1
for(j in 1:10){
for(i in a:x){
subset[i, "rank"] = j
}
j = j + 1
a = x + 1
x = x * j
}
However, the loop runs infinitely and keeps on adding additional rows to the data frame. I had to manually stop the loop and the resulting dimension of the subset data frame was 17926 * 6.
Please help me understand where am I going wrong in writing the loop.
P.S. subset is a data frame name and not the subset function in R
Thanks in Advance !!
It might be better for you to start working with vectorized calculations instead of loops. This will help you in the future.
For example:
df <- data.frame(x = 1:100)
df$rank <- (df$x-1)%/%10 + 1
df
results in:
x rank
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 3
22 22 3
23 23 3
24 24 3
25 25 3
How about something like this:
subset$Rank <- ceiling(as.numeric(rownames(subset))/10)
The as.numeric converts the rowname into a number, dividing it by 10 and rounding up should give you what you need? Let me know if I've misunderstood.

Binning with quantiles adding exception in r

I need to create 10 bins with the most approximate frequency each; for this,
I am using the function "ClassInvervals" from the library (ClassInt) with the style
'quantile' for binning some data. This is working for must columns; but, when I have a column that has 1 number repeated too many times, it appears an error that says that some brackets are not unique, which makes sense assuming the last +30% of the column data is the same number so the function doesn't know how to split the bins.
What I would like to do is that if a number is greater than the 10% of the length of the column, then treat it as a different bin, and if not, then use the function as it is.
For example, let's assume we have this DF:
df <- read.table(text="
X
1 5
2 29
3 4
4 26
5 4
6 17
7 4
8 4
9 4
10 25
11 4
12 4
13 5
14 14
15 18
16 13
17 29
18 4
19 13
20 6
21 26
22 11
23 2
24 23
25 4
26 21
27 7
28 4
29 18
30 4",h=T,strin=F)
So in this case the 10% of the length would be 3, so if we create a table containing the frequency of each number, it would appear something like this:
2 1
4 11
5 2
6 1
7 1
11 1
13 2
14 1
17 1
18 2
21 1
23 1
25 1
26 2
29 2
With this info, first we should treat "4" as a unique bin.
So we have a final output more or less like this:
X Bins
1 5 [2,6)
2 29 [27,30)
3 4 [4]
4 26 [26,27)
5 4 [4]
6 17 [15,19)
7 4 [4]
8 4 [4]
9 4 [4]
10 25 [19,26)
11 4 [4]
12 4 [4]
13 5 [2,6)
14 14 [12,15)
15 18 [15,19)
16 13 [12,15)
17 29 [27,30)
18 4 [4]
19 13 [12,15)
20 6 [6,12)
21 26 [26,27)
22 11 [6,12)
23 2 [2,6)
24 23 [19,26)
25 4 [4]
26 21 [19,26)
27 7 [6,12)
28 4 [4]
29 18 [15,19)
30 4 [4]
Until now, my approach has been something like this:
Moda <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Binner <- function(df) {
library(classInt)
#Input is a matrix that wants to be binned
for (c in 1:ncol(df)) {
if (sapply(df,class)[c]=="numeric") {
VectorTest <- df[,c]
# Here I get the 10% of the values
TenPer <- floor(length(VectorTest)/10)
while((sum(VectorTest == Moda(VectorTest)))>=TenPer) {
# in this loop I manage to remove the values that
# are repeated more than 10% but I still don't know how to add it as a special bin
VectorTest <- VectorTest[VectorTest!=Moda(VectorTest)]
Counter <- Counter +1
}
binsTest <- classIntervals(VectorTest_Fixed, 10- Counter, style = 'quantile')
binsBrakets <- cut(VectorTest, breaks = binsTest$brks)
df[ , paste0("Binned_", colnames(df)[c])] <- binsBrakets
}
}
return (df)
}
Can someone help me?
You could use cutr::smart_cut:
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
df$Bins <- smart_cut(df$X,list(10,"balanced"),"g",simplify = F)
table(df$Bins)
#
# [2,4) [4,5) [5,6) [6,11) [11,14) [14,18) [18,21) [21,25) [25,29) [29,29]
# 1 11 2 2 3 2 2 2 3 2
more on cutr and smart_cut
you can create two different dataframes: one with the 10% bins and the rest with the cut created bins. Then bind them together (make sure the bins are strings).
library(magrittr)
#lets find the numbers that appear more than 10% of the time
large <- table(df$X) %>%
.[. >= length(df$X)/10] %>%
names()
#these numbers appear less than 10% of the time
left_over <- df$X[!df$X %in% large]
#we want a total of 10 bins, so we'll cut the data into 10 - the number of 10%
left_over_bins <- cut(left_over, 10 - length(large))
#Let's combine the information into a single data frame
numbers_bins <- rbind(
data.frame(
n = left_over,
bins = left_over_bins %>% as.character,
stringsAsFactors = F
),
data.frame(
n = df$X[df$X %in% large],
bins = df$X[df$X %in% large] %>% as.character,
stringsAsFactors = F
)
)
If you table the information you'll get something like this
table(numbers_bins$bins) %>% sort(T)
4 (1.97,5] (11,14] (23,26] (17,20]
11 3 3 3 2
(20,23] (26,29] (5,8] (14,17] (8,11]
2 2 2 1 1

fill missing values with value from previous column

I have a data.frame whit some columns with missing values, and I want that the missing values are filled in with data from a previous column. For example:
country <- c('a','b','c')
yr01 <- c(15,16,7)
yr02 <- c(NA,18,NA)
yr03 <- c(20,22,NA)
yr04 <- c(15,18,19)
tab <- data.frame(country,yr01,yr02,yr03,yr04)
tab
country yr01 yr02 yr03 yr04
1 a 15 NA 20 15
2 b 16 18 22 18
3 c 7 NA NA 19
How can I make it so that the NA are replaced by the previous value? For example, in country a column yr02 will be equals to 15, and in country c columns year02 and yr03 will be 7. Thanks!
It's usually easier to work with columns, but we can apply to rows the standard answer from the R-FAQ Replace NAs with latest non-NA value.
tab[-1] = t(apply(tab[-1], 1, zoo::na.locf))
tab
# country yr01 yr02 yr03 yr04
# 1 a 15 15 20 15
# 2 b 16 18 22 18
# 3 c 7 7 7 19

Concatenate data frames together based on similar column values

Specifically, say I had three data frames d1, d2, d3:
d1:
X Y Z value
1 0 20 135 43
2 0 4 105 50
3 5 18 20 10
...
d2:
X Y Z value
1 0 20 135 15
2 0 4 105 14
3 2 9 12 16
...
d3:
X Y Z value
1 0 20 135 29
2 2 9 14 16
...
I want to be able to combine these data frames such that each row of the combined data frame consists of three values, based on all unique X, Y, Z combinations. If such an X, Y, Z combination does not exist in one of the original data frames then I just want it to have a value of null (or some arbitrarily low number if that isn't possible). So I'd want an output of:
dfinal:
X Y Z value1 value2 value3
1 0 20 135 43 15 29
2 0 4 105 50 14 null
3 5 18 20 10 null null
4 2 9 12 null 16 null
5 2 9 14 null null 16
...
Is there any efficient way of doing this? I've tried doing this instead using data.table which seemed more suited for this but have yet to figure out how.
?merge
Should do the trick?
By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.
So:
merge(d1,d2, by=c("X","Y","Z"))
And you can include all=TRUE, to have complete rows.
The missing data will be NA
merge(d1,d2, by=c("X","Y","Z"), all=TRUE)
Take a look at dplyr and its join methods. I wrote a small example:
library(dplyr)
library(data.table)
d1 <- data.table(X = c(1,2,3), Y = c(2,3,4), Z = c(8,3,9), value = c(22,3,44))
d2 <- data.table(X = c(1,4,3), Y = c(2,6,4), Z = c(8,9,9), value = c(44,22,11))
d2 <- rename(d2, value2 = value)
full_join(d1,d2)
output:
X Y Z value value2
1 1 2 8 22 44
2 2 3 3 3 NA
3 3 4 9 44 11
4 4 6 9 NA 22

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources