Two columns from a single column values in r - r

I have data of single column and want to convert into two columns:
beta
2
.002
52
.06
61
0.09
70
0.12
85
0.92
I want into two col as:
col1 col2
2 0.002
52 0.06
61 0.09
70 0.12
85 0.92
Can anyone please help me sort this out????

We can unlist the dataframe and convert it into the matrix of nrow/2 rows
data.frame(matrix(unlist(df), nrow = nrow(df)/2, byrow = T))
# X1 X2
#1 2 0.002
#2 52 0.060
#3 61 0.090
#4 70 0.120
#5 85 0.920

We can do a logical index and create two columns
i1 <- c(TRUE, FALSE)
df2 <- data.frame(col1 = df1$beta[i1], col2 = df1$beta[!i1])

Related

How do I use for loop to find unique values in all columns in a Dataframe

I want to find out the unique values in every column in the dataframe using a for loop. Using names(df) stores the column names to a character datatype, which doesn't work in this case.
This may be what you're looking for:
set.seed(123)
df <- data.frame(a = sample(1:100, 20),
b = sample(LETTERS, 20),
c = round(runif(20),2))
for(i in colnames(df)){
cat("Unique values in", i, ":", unique(df[,i]), "\n")
}
Output:
#Unique values in a : 31 79 51 14 67 42 50 43 97 25 90 69 57 9 72 26 7 95 87 36
#Unique values in b : N Q K G U L O J M W I P S Y X E R V C F
#Unique values in c : 0.75 0.9 0.37 0.67 0.09 0.38 0.27 0.81 0.45 0.79 0.44 0.63 0.71 0 0.48 0.22

Apply two different formulas on four data frame columns

I want to apply two different formulas on four columns of my dataframe df. I have done this manually, but since my original data frame has several columns, I want to be able to use loops or case when to do this faster.
Here's how sample dataframe df looks like:
A B C D
20 100 4 1200
40 150 6 2300
34 200 3 1230
32 225 9 1100
12 220 10 1000
Formula 1:
(x-max(x))/(max(x)-min(x))
Formula 2:
(min(x)-x)/(max(x)-min(x))
I'd like to apply formula 1 on columns B and D and formula 2 on columns A and C.
After applying the formula, I want to store the values in a different dataframe but with the same column names.
Here's what I did:
formula_1 <-function(x) {
(((x - min(x)))/(max(x) - min(x)))
}
formula_2 <-function(x){(min(x)-x)/(max(x)-min(x))
}
Create an empty dataframe BI_score
BI_score$B <- formula_1(df$B)
BI_score$D <- formula_1 (df$D)
BI_score$A <- formula_2 (df$A)
BI_score$C <- formula_2 (df$C)
EDIT
As there are some NAs and Inf values and if we want to exclude them from calculation, we can handle it by updating the function as below and then apply the function to column as shown previously.
formula_1 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (((temp - min(temp)))/(max(temp) - min(temp))))
}
formula_2 <-function(x) {
temp <- x[is.finite(x)]
replace(x, is.finite(x), (min(temp)-temp)/(max(temp)-min(temp)))
}
The most straight forward approach would be to use lapply to apply the function separately on selected columns.
BI_score <- df
fm1_cols <- c("B", "D")
fm2_cols <- c("A", "C")
BI_score[fm1_cols] <- lapply(df[fm1_cols], formula_1)
BI_score[fm2_cols] <- lapply(df[fm2_cols], formula_2)
BI_score
# A B C D
#1 -0.29 0.00 -0.14 0.154
#2 -1.00 0.40 -0.43 1.000
#3 -0.79 0.80 0.00 0.177
#4 -0.71 1.00 -0.86 0.077
#5 0.00 0.96 -1.00 0.000
As mentioned by #Sotos, if you want to apply the function on alternate columns you could do
BI_score[c(TRUE, FALSE)] <- lapply(df[c(TRUE, FALSE)], formula_1)
BI_score[c(FALSE, TRUE)] <- lapply(df[c(FALSE, TRUE)], formula_2)
Just for fun, approach using dplyr
library(dplyr)
bind_cols(df %>% select(fm1_cols) %>% mutate_all(formula_1),
df %>% select(fm2_cols) %>% mutate_all(formula_2))
If your goal is to apply the two functions on alternating columns, then you can do it via logical indexing
cbind.data.frame(sapply(df[c(TRUE, FALSE)], formula_2),
sapply(df[c(FALSE, TRUE)], formula_1))
# A C B D
#1 -0.2857143 -0.1428571 0.00 0.15384615
#2 -1.0000000 -0.4285714 0.40 1.00000000
#3 -0.7857143 0.0000000 0.80 0.17692308
#4 -0.7142857 -0.8571429 1.00 0.07692308
#5 0.0000000 -1.0000000 0.96 0.00000000
We can use mutate_at from dplyr
library(dplyr)
df1 %>%
mutate_at(vars(B, D), formula_1) %>%
mutate_at(vars(A, C), formula_2)

R Populate a vector by matching names to df column values

I have a named vector filled with zeros
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
I want to populate the vector with count data from a dataframe
size count
37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024
I need help finding a way to match the value for size to the vector name and then input the corresponding count value into that vector position
Might be as simple as:
toy1[ as.character(dat$size) ] <- dat$count
toy1
# 37 38 39 40 41 42 43 44 45
#1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024
R's indexing for assignments can have character values. If you had just tried to index with the raw column:
toy1[ dat$size ] <- dat$count
You would have gotten (as did I initially):
> toy1
37 38 39 40 41 42 43 44 45
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1.181 0.421
0.054 0.005 0.031 0.582 NA NA 0.024
That occurred because numeric indexing occurred and there was default extension of the length of the vector to accommodate the numbers up to 45.
With a version of the dataframe that had a number that was not in the range 37:45, I did get a warning from using match with a nomatch of 0, but I also got the expected results:
toy1[ match( as.character( dat$size), names(toy1) , nomatch=0) ] <- dat$count
#------------
Warning message:
In toy1[match(as.character(dat$size), names(toy1), nomatch = 0)] <- dat$count :
number of items to replace is not a multiple of replacement length
> toy1
37 38 39 40 41 42 43 44 45
1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.000
The match function is at the core of the merge function but this application would be much faster than a merge of dataframes
Lets say your data frame is df, then you can just update the records in toy1 for records available in your data frame:
toy1[as.character(df$size)] <- df$count
Edit: To check for a match m before updating the records. m are the matched indices in size column of df:
m <- match(names(toy1), as.character(df$size))
Then, for the indices in toy1 which have a match, it can be updated as below:
toy1[which(!is.na(m))] <- df$count[m[!is.na(m)]]
PS: Efficient way would be to define toy1 as a data frame and perform an outer join by size column.
First, let's get the data loaded in.
toy1<- rep(0, length(37:45))
names(toy1) <- 37:45
df = read.table(text="37 1.181
38 0.421
39 0.054
40 0.005
41 0.031
42 0.582
45 0.024")
names(df) = c("size","count")
Now, I present a really ugly solution. We only update toy1 where the name of toy1 appears in df$size. We return df$count by obtaining the index of the match in df. I use sapply to get a vector of the index back. On both sizes we only look for places where names(toy1) appear in df$size.
toy1[names(toy1) %in% df$size] = df$count[sapply(names(toy1)[names(toy1) %in% df$size],function(x){which(x == df$size)})]
But, this isn't very elegant. Instead, you could turn toy1 into a data.frame.
toydf = data.frame(toy1 = toy1,name = names(toy1),stringsAsFactors = FALSE)
Now, we can use merge to get the values.
updated = merge(toydf,df,by.x = "name",by.y="size",all.x=T)
This returns a 3 column data.frame. You can then extract the count column from this, replace NA with 0 and you're done.
updated$count[is.na(updated$count)] = 0
updated$count
#> [1] 1.181 0.421 0.054 0.005 0.031 0.582 0.000 0.000 0.024

How to select columns based on criteria in a certain row in R

I have a matrix of values with both row names and column names, as shown here.
C5.Outliers
Days J1 J2 J3 J4
0.01 458 -160 -151 -52
0.02 459 -163 -154 -46
0.03 457 -165 -150 -51
Perc 0.99 0.04 0.00 0.52
I would like to create a separate matrix using only the columns for which the value for the row "Perc" is =<50.0. In this example, I would be extracting columns J2 and J3.
This is the code I tried which isn't working (the "Perc" row is row #1414 on my matrix):
C5.Final<-subset(C5.Outliers, 1414<.51)
I assume you mean 0.50 since all the columns with the "Perc" are above 50.0.
this might not be the best way but it works:
#data:
df <- data.frame(Days=c(0.01,0.02,0.03,"Perc"),J1=c(458,459,457,0.99),
J2 =c(-165,-163,-160,0.04),J3=c(-151,-153,-131,0.00),J4=c(-52,-45,-51,0.52))
dfc <- subset(df,,select= which(c(TRUE,(df[which(df$Days == "Perc"), ] <= 0.50)[2:5])))
dfc
Days J2 J3
1 0.01 -165.00 -151
2 0.02 -163.00 -153
3 0.03 -160.00 -131
4 Perc 0.04 0
You can remove the TRUE, if you dont want the df$Days variable, change the 0.50 threshold if needed and expand the 2:5 if you have extra columns or even substitute the "Perc" with 1414 if you so wish.
Hope this works.
Presumably you meant <= 0.50 and not <= 50 since all "Perc" are less than 50. You can do
df[, unlist(df["Perc",]) <= 0.5]
# J2 J3
# 0.01 -160.00 -151
# 0.02 -163.00 -154
# 0.03 -165.00 -150
# Perc 0.04 0
But this may be safer and takes into account any NA values that may appear in "Perc".
u <- unlist(df["Perc",]) <= 0.50
df[, u & !is.na(u)]
Also, you can speed it up if need be by adding use.names = FALSE in unlist(). And finally, if you have a matrix and not a data frame, then you can remove unlist() all together.

R Programming issue intervals

I'm trying to figure out a formula to be able to divide the max and min number inside the intervals.
x <- sample(10:40,100,rep=TRUE)
factorx<- factor(cut(x, breaks=nclass.Sturges(x)))
xout<-as.data.frame(table(factorx))
xout<- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
Using the above code in the R editor program, I get the following:
xout
factorx Freq cumFreq relative
1 (9.97,13.8] 14 14 0.14
2 (13.8,17.5] 13 27 0.13
3 (17.5,21.2] 16 43 0.16
4 (21.2,25] 5 48 0.05
5 (25,28.8] 11 59 0.11
6 (28.8,32.5] 8 67 0.08
7 (32.5,36.2] 16 83 0.16
8 (36.2,40] 17 100 0.17
What I want to know is if there is a way to calculate the interval. For example it would be:
(13.8 + 9.97)/2
It's called the class midpoint in statistics I believe.
Here's a one-liner that is probably close to what you want:
> sapply(strsplit(levels(xout$factorx), ","), function(x) sum(as.numeric(gsub("[[:space:]]", "", chartr(old = "(]", new = " ", x))))/2)
[1] 11.885 15.650 19.350 23.100 26.900 30.650 34.350 38.100
#One possible solution is to split by (,] (xout is your dataframe)
x1<-strsplit(as.character(xout$factorx),",|\\(|]")
x2<-do.call(rbind,x1)
xout$lower=as.numeric(x2[,2])
xout$higher=as.numeric(x2[,3])
xout$ave<-rowMeans(xout[,c("lower","higher")])
> head(xout,3)
factorx Freq cumFreq relative higher lower aver
1 (9.97,13.7] 15 15 0.15 13.7 9.97 11.835
2 (13.7,17.5] 14 29 0.14 17.5 13.70 15.600
3 (17.5,21.2] 12 41 0.12 21.2 17.50 19.350

Resources