I have a data frame like this:
GN SN
a 0.1
b 0.2
c 0.3
d 0.4
e 0.4
f 0.5
I would like the following output:
GN
a
0.1
b
0.2
c
0.3
Can anyone help me? How to "interleave" the elements of the second column to the elements of the first column to gain the desired output?
First let's create some data:
dd = data.frame(x = 1:10, y = LETTERS[1:10])
Next, we need to make sure the y column is a character and not a factor (otherwise, it will be converted to a numeric)
dd$y = as.character(dd$y)
Then we transpose the data frame and convert to a vector:
as.vector(t(dd))
However, a more pertinent question is why you would want to do this.
Related
I'm working through the options for formatting all numeric values in a data frame, and not just selected columns. I start with the following base data frame called "c" when running the code beneath it:
> c
A B
1 3.412324 2.234200
2 3.245236 4.234234
Related code:
a <- c(3.412324463,3.2452364)
b <- c(2.2342,4.234234)
c <- data.frame(A=a, B=b, stringsAsFactors = FALSE)
Next, I round all the numbers in the above "c" data frame to 2 decimal places, resulting in data frame "d" shown below with the related code immediately underneath:
> d
A B
[1,] 3.41 2.23
[2,] 3.25 4.23
Related code:
d <- as.data.frame(lapply(c, formatC, decimal.mark =".", format = "f", digits = 2))
d <- sapply(d[,],as.numeric)
Last step, I'd like to express the above data frame "d" in percentages, in a new data frame called "e". I get the below results as a list, using the code shown beneath it.
> e
X.341.0.. X.325.0.. X.223.0.. X.423.0..
1 341.0% 325.0% 223.0% 423.0%
Related code:
e <- as.data.frame(lapply(d*100, sprintf, fmt = "%.1f%%"))
How to I modify the code, in an efficient manner, to leave the data frame structure intact in deriving data frame "e", the way it does when generating data frame "d"? It would be most helpful to see a solution in both base R and dplyr.
I'm pretty sure the issue lies in my use of lapply() in creating data frame "e" (yes, by now I know lapply() spits out lists), but it worked fine in maintaining the data frame structure in creating data frame "d"!
All values in the data frame are formatted the same, so there's no need to subset the columns etc.
The thing with lapply is that it returns a list. A data.frame is a special case of a list, but to make lapply modify a data frame of a particular structure while maintaining that structure, the easiest way to do it is df[] = lapply(df, ...). The extra [] preserves the existing structure.
## d better base version
## `round` is friendlier than `sprintf` - it returns numerics
d = c
d[] = lapply(d, round, 2)
d
# A B
# 1 3.41 2.23
# 2 3.25 4.23
## dplyr version
d_dplyr = c %>%
mutate(across(everything(), round, 2))
d_dplyr
# A B
# 1 3.41 2.23
# 2 3.25 4.23
## e base
e = d
e[] = lapply(e * 100, sprintf, fmt = "%.1f%%")
e
# A B
# 1 341.0% 223.0%
# 2 325.0% 423.0%
## e dplyr
## in `tidyverse`, we'll use `scales::percent_format`
## which generates a percent conversion function according to specification
## the default already multiplies by 100 and adds the `%` sign
## we just need to specify the accuracy.
## (note that we can easily start from `c` again)
e_dplyr = c %>%
mutate(across(everything(), scales::percent_format(accuracy = 0.1)))
e_dplyr
# A B
# 1 341.2% 223.4%
# 2 324.5% 423.4%
I have a next task
a = data.frame(a= c(1,2,3,4,5,6)) # dataset
range01 <- function(x){(x-min(a$a))/(max(a$a)-min(a$a))} # rule for scale
b = data.frame(a = 6) # newdaset
lapply(b$a, range01) # we can apply range01 for this dataset because we use min(a$a) in the rule
But how can I apply this when i have many columns in my dataset? like below
a = data.frame(a= c(1,2,3,4,5,6))
b = data.frame(b= c(1,2,3,3,2,1))
c = data.frame(c= c(6,2,4,4,5,6))
df = cbind(a,b,c)
df
new = data.frame(a = 1, b = 2, c = 3)
Of course I can make rules for every variable
range01a <- function(x){(x-min(df$a))/(max(df$a)-min(df$a))}
But it's very long way. How to make it convenient?
You can redefine your scale function so it takes two arguments; One to be scaled and one the scaler as follows, and then use Map on the two data frames:
scale_custom <- function(x, scaler) (x - min(scaler)) / (max(scaler) - min(scaler))
Map(scale_custom, new, df)
#$a
#[1] 0
#$b
#[1] 0.5
#$c
#[1] 0.25
If you need the data frame as result:
as.data.frame(Map(scale_custom, new, df))
# a b c
#1 0 0.5 0.25
You can exploit the fact that the column names of new and df are same. Could be helpful if the order of the columns in the two dataframes is not the same.
sapply(names(new), function(x) (new[x]-min(df[x]))/(max(df[x])-min(df[x])))
#$a.a
#[1] 0
#$b.b
#[1] 0.5
#$c.c
#[1] 0.25
to put in data.frame
data.frame(lapply(names(new), function(x) (new[x]-min(df[x]))/(max(df[x])-min(df[x]))))
# a b c
#1 0 0.5 0.25
I am trying to vectorize the following task with one of the apply functions, but in vain.
I have a list and a dataframe. What I am trying to accomplish is to create subgroups in a dataframe using a lookup list.
The lookup list (which are basically percentile groups) looks like the following:
Look_Up_List
$`1`
A B C D E
0.000 0.370 0.544 0.698 9.655
$`2`
A B C D E
0.000 0.506 0.649 0.774 1.192
The Curret Dataframe looks like this :
Score Big_group
0.1 1
0.4 1
0.3 2
Resulting dataframe must look like the following with an additional column. It matches the score in the percentile bucket from the lookup list in the corresponding Big_Group:
Score Big_group Sub_Group
0.1 1 A
0.4 1 B
0.3 2 A
Thanks so much
You can create a function like this:
myFun <- function(x) {
names(Look_Up_List[[as.character(x[2])]])[
findInterval(x[1], Look_Up_List[[as.character(x[2])]])]
}
And apply it by row with apply:
apply(mydf, 1, myFun)
# [1] "A" "B" "A"'
# reproducible input data
Look_Up_List <- list('1' <- c(A=0.000, B=0.370, C=0.544, D=0.698, E=9.655),
'2' <- c(A=0.000, B=0.506, C=0.649, D=0.774, E=1.192))
Current <- data.frame(Score=c(0.1, 0.4, 0.3),
Big_group=c(1,1,2))
# Solution 1
Current$Sub_Group <- sapply(1:nrow(Current), function(i) max(names(Look_Up_List[[1]][Current$Score[i] > Look_Up_List[[1]] ])))
# Alternative solution (using findInterval, slightly slower at least for this dataset)
Current$Sub_Group <- sapply(1:nrow(Current), function(i) names(Look_Up_List[[1]])[findInterval(Current$Score[i], Look_Up_List[[1]])])
# show result
Current
Example Dataset:
A 2 1.5
A 2 1.5
B 3 2.0
B 3 2.5
B 3 2.6
C 4 3.2
C 4 3.5
So here I would like to create 3 frequency histograms based on the first two columns so A2, B3, and C4? I am new to R any help would be greatly appreciated should I flatten out the data so its like this:
A 2 1.5 1.5
B 3 2.0 2.5 2.6 etc...
Thank you
Here's an alternative solution, that is based on by-function, which is just a wrapper for the tapply that Jilber suggested. You might find the 'ex'-variable useful:
set.seed(1)
dat <- data.frame(First = LETTERS[1:3], Second = 1:2, Num = rnorm(60))
# Extract third column per each unique combination of columns 'First' and 'Second'
ex <- by(dat, INDICES =
# Create names like A.1, A.2, ...
apply(dat[,c("First","Second")], MARGIN=1, FUN=function(z) paste(z, collapse=".")),
# Extract third column per each unique combination
FUN=function(x) x[,3])
# Draw histograms
par(mfrow=c(3,2))
for(i in 1:length(ex)){
hist(ex[[i]], main=names(ex)[i], xlim=extendrange(unlist(ex)))
}
Assuming your dataset is called x and the columns are a,b,c respectively I think this command should do the trick
library(lattice)
histogram(~c|a+b,x)
Notice that this requires you to have the package lattice installed
I have a dataframe a, with A, B, C are separate entries
Source Target N
A B 100
A D 200
I have another dataframe b for entries' attributes
Name Rate1 Rate2
A 0.1 0.2
B 0.2 0.3
I want to calculate a new column Flow in a, as it is calculated row based by Flow = a$N * b[Name == a$Source]$Rate1. I tried to use apply by row, but I felt it's slow. Is there a faster way?
I don't know what you have tried with apply, but here an answer with merge and transform
transform(merge(a,b,by.x = 'Source',by.y ='Name'),flow = N*Rate1)
Source Target N Rate1 Rate2 flow
1 A B 100 0.1 0.2 10
2 A D 200 0.1 0.2 20
Here's a fairly expressive solution, fairly similar to the code you tried:
> a$Flow <- a$N*b$Rate1[ match(a$Source, b$Name) ]
> a
Source Target N Flow
1 A B 100 10
2 A D 200 20
The match function is the basis for merge and %in%. It is particularly useful for constructing index vectors to pick from alternatives.