I created a "trust in politics"-index by aggregating 5 variables, all of which measured some form of trust on a scale from 1-10.
attach(ess_variablen)
aggr_trst <- (1/5)*(trst_prl+trst_leg+trst_part+trst_politic+trst_polit)
However, the results now contain 1 decimal, whereas I would like to round the numbers in order not to have any decimals in the index numbers.
I have not been able to find a work-around in order to round numerical values created by an index. Does anyone know how to achieve that? Thank you!
The round() function can be used to round to the nearest whole number when one uses 0 as the number of decimal places. The following example illustrates this by summing rows in a data frame of uniform random numbers.
set.seed(950141237)
data <- as.data.frame(matrix(runif(20),nrow=5,ncol=4))
data
data$index <- round(rowSums(data),0) # round to 0 decimal places to obtain whole numbers
data
...and the output.
> set.seed(950141237)
> data <- as.data.frame(matrix(runif(20),nrow=5,ncol=4))
> data
V1 V2 V3 V4
1 0.07515484 0.8874008 0.37130877 0.05977506
2 0.30097513 0.8178419 0.05203982 0.10694951
3 0.82328607 0.4182799 0.24034152 0.52173278
4 0.52849399 0.8690592 0.66814229 0.66475498
5 0.01914658 0.8322007 0.41399458 0.19649338
> data$index <- round(rowSums(data),0) # round to 0 decimal places to obtain whole numbers
> data
V1 V2 V3 V4 index
1 0.07515484 0.8874008 0.37130877 0.05977506 1
2 0.30097513 0.8178419 0.05203982 0.10694951 1
3 0.82328607 0.4182799 0.24034152 0.52173278 2
4 0.52849399 0.8690592 0.66814229 0.66475498 3
5 0.01914658 0.8322007 0.41399458 0.19649338 1
Edit : Seems previous solution did not works sorry for inconvenience
You can try the ceiling() and the floor() function depending of your needs
notRound <- (1/5) * (1.2+2.3+3.3)
print(notRound)
roundUp <- ceiling(notRound)
print(roundUp)
roundDown <- floor(notRound)
print(roundDown)
Snippet live if you want to try
http://rextester.com/WURVZ7294
Related
Suppose I have a vector V1 (with two or more elements):
V1 <- 1:10
I can reorder the original vector with the function sample. This function, however, cannot make sure that none element in the new vector being in the same position as the original vector. For example:
set.seed(4)
V2 <- sample(V1)
This will result in a vector that has two elements being in the same position as the original one:
V1[V1 == V2]
3 5
My question is: Is it possible to generate a random vector to make sure that no element being in the same position between the two vectors?
Your requirement of not having certain indices in the vector not being able to shift means that you don't want a purely random permutation, where that might happen. The best I could come up with is to just loop, using sample until we find a vector where every element shifts:
v1 <- 1:10
v1_perm <- v1
cnt <- 0
while (sum(v1 == v1_perm) > 0) {
v1_perm <- sample(v1)
cnt <- cnt + 1
}
v1
v1_perm
paste0("It took ", cnt, " tries to find a suitable vector")
[1] 1 2 3 4 5 6 7 8 9 10
[1] 3 10 4 7 8 1 6 2 5 9
[1] "It took 3 tries to find a suitable vector"
Demo
Note that I have implemented the requirement of shifting positions with shifting values. This of course isn't strictly true, because two values could be the same. But, assuming all your entries are unique, then checking for zero overlap of values equates with zero overlap of indices.
If I have an ASCII text file that reads like this:
12345
and I want to separate it by integers so that it becomes
v1 v2 v3 v4 v5
1 2 3 4 5
In other words, each integer is a variable.
I know I can use the read.fwf in R but since I have nearly 500 variables in my dataset, is there a better way to divide the integers up into their own columns than having to put widths=c(1,) and repeat the "1," 500 times?
I also tried importing the ASCII file into Excel and SPSS but both don't allow me to put in the variable breaks at fixed integer distances.
You could determine the width of the file by reading in one row as-is, then use that for read_fwf. Using tidyverse functions,
library(readr)
library(stringr)
path <- "path_to_data.txt" # your path
# one pass of the data
pass <- read_csv(path, col_names = FALSE, n_max = 1) # one row, no header
filewidth <- str_length(pass[1, ]) # width of first row
# use fwf with specified number of columns
df <- read_fwf(path, fwf_widths(rep(1, filewidth)))
Here's an option using read.fwf(), which was your initial choice.
# for the example only, a two line source with different line lengths
input <- textConnection("12345\n6789")
df1 <- read.fwf(input, widths = rep(1, 500))
ncol(df1)
# [1] 500
But assume you actually have less than 500 (as you say, and is the case in this example), then the extra columns with all values set to NA can be removed as follows. This uses your longest line to determine the number of columns that are retained.
df1 <- df1[, apply(!is.na(df1), 2, all)]
df1
# V1 V2 V3 V4 V5
# 1 1 2 3 4 5
# 2 6 7 8 9 NA
However, if no missing values are acceptable, then use any() to use your shortest line to determine the number of columns that are retained.
df1 <- df1[, apply(!is.na(df1), 2, any)]
df1
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 6 7 8 9
Of course, if you know the exact line length and all lines are the same length, then just set widths = rep(1, x) with x set to the known length.
If you are using Excel 2010 or later, you can import the file using Power Query (aka Get & Transform). When you edit the input, there is an option to split columns and specify the number of characters:
This tool is included in Excel 2016, and is a free Microsoft add-in for Excel 2010 and later.
I'm working with an expression matrix obtained by single cell RNA sequencing, but I have a question related with the R code one mate has sent me...
sort(unique(1 + slot(as(data_matrix, "dgTMatrix"), "i")))
# there isn't more details in the code...
In theory, this function is to delete non expressed genes (if it's zero in all samples, it think...), but it's impossible for me to understand it, anyone can give me a tip?
Well, I think I have understood this code... let's try to explain it! (please, correct me if I'm wrong).
Our data has a structure of sparse matrix (ie. more handly in regards to memory, link) and with as it's coerced to a specific format for this kind of matrix (Triplet Format for Sparse Matrices, link): three columns with i and j index for these non-zero values.
y <- matrix_counts # sparse matrix
AAACCTGAGAACAACT-1 AAACCTGTCGGAAATA-1 AAACGGGAGAGCTGCA-1
ENSG00000243485 1 . .
ENSG00000237613 . . 2
y2 <- as(y, "dgTMatrix") #triplet format for sparse matrix
i j x
1 9 1 1 #in row(9) and column(1) we have the value 1
2 50 1 2
3 60 1 1
4 62 1 2
5 78 1 1
6 87 1 1
After, it takes only the column "i" (slot(data, "i")), because we only need the row index (to know what rows are different to zero), and delete duplicates (unique) to finally obtain a vector with the row index which will be used to filter the raw data:
y3 <- unique(1 + slot(as(exprs(gbm), "dgTMatrix"), "i"))
[1] 9 50 60 62 78 87
data <- data_raw[y3,]
I am a bit confused with sort and 1+, but I think this is the basic concept. So, to summarize, we take the row index from this non-zero rows (genes) and use it to filter our raw data... another original method for delete non-expressed genes, interesting!
New to R and would like to do the following operation:
I have a set of numbers e.g. (1,1,0,1,1,1,0,0,1) and need to count adjacent duplicates as they occur. The result I am looking for is:
2,1,3,2,1
as in 2 ones, 1 zero, 3 ones, etc.
Thanks.
We can use rle
rle(v1)$lengths
#[1] 2 1 3 2 1
data
v1 <- c(1,1,0,1,1,1,0,0,1)
I have searched long and hard for a solution using apply but I am unable to find exactly what I need. I'm a new R user comming over from Excel and need to calculate the percent difference from an observation with a control. A realistic sample data frame looks like this:
site <- c(rep(1, 10), rep(2,10), rep(3,10))
element <-rep(c("ca", "Mg", "K"), 10)
control <- seq(from= 1,to=60, by=2)
BA01 <- seq(from= 31,to=90, by=2)
BA02 <- seq(from= 21,to=80, by=2)
BA03 <- seq(from= 101,to=160, by=2)
mydf <- data.frame(site, element, control, BA01, BA02,BA03)
where BA01 to BA03 are different test which will be compared to the control.
all I would like to do, is make a formula like this:
((BA01-control)/control)*100
and have it calculated for every test column(BA01-BA03) and every row in the data frame. In Excel I could just copy and paste the site and element columns plus the headers BA01-BA03 the type the formula in cell C2 and drag the formula to the right as far as needed then down as far as needed and have my results. In R I'm having difficulty getting the same results. I've tried apply already but can not get it to work. Basically, I would like to have Site and Element as columns 1 and 2, followed by the results from the formula with BA01, BA02 and BA03 as the column names. Probably it wouldn't make a difference, but my real data frame will have upwards of 130 columns and several thousand rows.
Does anyone have some tips for me?
Thank you very much in advance for your help.
Dan
If I understand correctly:
cbind(mydf[1:2],sapply(mydf[-(1:3)],function(x) 100*(x-mydf[[3]])/mydf[[3]]))
site element BA01 BA02 BA03
1 1 ca 3000.00000 2000.00000 10000.0000
2 1 Mg 1000.00000 666.66667 3333.3333
3 1 K 600.00000 400.00000 2000.0000
4 1 ca 428.57143 285.71429 1428.5714
5 1 Mg 333.33333 222.22222 1111.1111
...
Try this:
cbind(mydf[1:2], 100 * mydf[4:6] / mydf$control - 100)
The first 5 lines of output are:
site element BA01 BA02 BA03
1 1 ca 3000.00000 2000.00000 10000.0000
2 1 Mg 1000.00000 666.66667 3333.3333
3 1 K 600.00000 400.00000 2000.0000
4 1 ca 428.57143 285.71429 1428.5714
5 1 Mg 333.33333 222.22222 1111.1111
How about:
pdiff <- function(x,y) (x-y)/y*100
BAcols <- subset(mydf,select=c(BA01,BA02,BA03))
This subset is readable for a small data frame but if you really have lots of rows you want to normalize you will want to select these columns by using a numeric range, i.e. mydf[,-(1:3)] (drop the first three columns) or mydf[,4:ncol(mydf)] (keep columns 4 until the end).
cbind(mydf[,1:2],sweep(BAcols,1,mydf$control,pdiff))
or
with(mydf,data.frame(site,element,sweep(BAcols,1,control,pdiff)))