GT VB WM
23 34 28
34 27 33
44 46
54
I have a data like above in a csv file.I need a R script to retrieve by column wise values either by loop or function when passing arguments as a variable name.Ex. When I type GT I should get relevant values without NA like
GT
23 34
This
lapply(df, na.omit)
creates a list of vector where all NAs are removed.
Based on all the information you have given, the following R commands (an "R script") will do this for you. I'm assuming that the CSV file contains 3 columns called GT, VB and WM in the first row and there are 4 rows of data starting in row 2. I'm also assuming that the file is in fact a comma separated file format, meaning that the columns (including the header row) is separated by commas.
df <- read.csv("myfile.csv")
If you don't want NA values to appear whenever you type the name of the column, you'll have to remove the NA from each element of the data frame, saving the results as a list (since a data frame cannot have columns of unequal length), using either lapply or sapply.
df.list <- sapply(df, FUN=function(x) x[!is.na(x)])
And then attach to it:
attach(df.list)
Typing the names of the columns should return the original values with NA omitted.
GT
[1] 23 34
VB
[1] 34 27 44 54
WM
[1] 28 33 46
When you are finished, detach from this modified R object as it is good practice to do so.
detach(df.list) # Good practice
And this does exactly what you said. No more and no less.
data
library(tibble)
df <- tribble(~GT, ~VB, ~WM,
23, 34, 28,
34, 27, 33,
NA, 44, 46,
NA, 54, NA)
Related
I have a data frame with 25 million rows and I need to run a substring function to all 25 million rows of data. Because of the size of the data frame I thought apply would be the most efficient way of doing this.
df <- data.frame( seq_start=c(75, 59, 44),
seq_end=c(151, 135, 120),
sequence=c("NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA", "NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG", "NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA"))
Function to accomplish this that I thought would be the most efficient:
apply(df,1,substr(sequence,seq_start,seq_end))
I'm not familiar with the apply function and a loop is way to inefficient to process 25 million lines.
Not 100% sure what you need/want but it seems that using the dplyrsyntax is useful here (more useful than apply as you're only looking to extract a substring from a single column)
library(dplyr)
df %>%
mutate(substring = substr(sequence,seq_start,seq_end))
seq_start seq_end
1 75 151
2 59 135
3 44 120
sequence
1 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG
3 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA
substring
1 ATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 TAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACAC
3 AAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATAT
Base R:
df$substring <- substr(df$sequence,df$seq_start,df$seq_end)
I'm reading a .sav file using haven:
library(haven)
data <- read_spss("file.sav", user_na = FALSE)
Then trying to display one of the variables in a table:
table(data$region)
Which returns:
1 2 3 4 5 6 7 8 9 10 11 12
85 208 43 171 30 40 95 310 133 29 77 36
Which is technically correct, however - in SPSS, the numerical values in the top row have labels associated with them (region names in this case). If I just run data$region, it shows me the numbers and their associated labels at the end of the output, but is there a way to make those string labels appear in the first table row instead of their numerical counterparts?
Thank you in advance for your help!
The way to do this is to cast the variable as a factor, using the "labels" attribute of the vector as the factor levels. The sjlabelled package includes a function that does this in one step:
data$region <- sjlabelled::as_label(data$region)
While the table command will still work on the resulting data, the layout may be a little messy. The forcats package has a function that pretty-prints frequency tables for factors:
data$region %>% forcats::fct_count()
I have a matrix with columns denoting 30 different frequency windows and rows denoting dates. I would like to extract each column and assign a variable to each resulting vector and have the name of that variable be the name of that frequency window (which I have in center values, so I'd like to name each variable something like f100). What is the best way to write a loop to both extract and name each variable?
Thanks!
If you want to create 30 variables in the global environment from the columns of the matrix, you could use list2env or assign (I would probably keep it together in a matrix/dataframe or even in a list and do all the necessary operations rather than cluttering the global environment with lots of variables).
list2env(lapply(as.data.frame(mat), function(x) x), envir=.GlobalEnv)
# <environment: R_GlobalEnv>
f1
#[1] 37 38 12 34 26 21 30 6 27 29
data
set.seed(42)
mat <- matrix(sample(1:40, 30*10, replace=TRUE), ncol=30,
dimnames=list(NULL, paste0("f", 1:30)))
I have a dataset with 100 columns and it doesn't have a header.
I have an int vector that consists of some numbers ranges between 1 to 100. For example, a vector with "2 5 62 78".
Now when I read the dataset using read.table, all I want is to select column 2, 5, 62 and 78 from the dataset. How can I do that? Many thanks.
What you want is the option colClasses of read.table() (and the derivative functions). It allows you to pass a character vector with the classes of each column in the data. If you set that to "NULL" the column will be skipped. You can set the whole thing to "NULL" and then only change the ones you want to import (based on their class).
Proof of concept below.
cc <- rep('NULL', 100) ## skip all 100 columns
cc[c(2, 5)] <- 'integer' ## 2 and 5 are integer
cc[c(62, 58)] <- 'character' ## 62 and 58 will be imported as character
df <- read.csv('really-wide-data.csv', colClasses=cc)
For a function that I am writing, the output is a dataframe. But how do i assign the values that are in one of the columns of my dataframe to objects?
For example, if I have 2 vectors that I cbind into a dataframe
>numbers<-c(33, 44, 55, 66)
>names<-c("A", "B", "C", "D")
>MYdataframe<-data.frame(cbind(names, numbers))
I will get this:
>MYdataframe
names numbers
1 A 33
2 B 44
3 C 55
4 D 66
But how do I assign the numbers (e.g. 33) to objects (e.g. A)
It does not look like a very good idea: your function would be assigning variables in the global environment, or in its parent environment, instead of returning something. If you want to return several values, you can put them in a named list, e.g., list(A=3.14, B=2.71), or a vector if they all have the same type (they do, if you can put them in a data.frame).
In addition, in your example, cbind converts the numbers into factors: I am not sure this is intentional.
However, if you really insist, this can be done with assign.
library(plyr)
d_ply( MYdataframe, "names", function(u)
assign( as.character(u$names[1]), u$numbers, envir=.GlobalEnv)
)
If you really wanted to use the character values as names and the numeric values as "names' for a numeric vector then this would do it:
names(numbers) <- names
numbers
# A B C D
#33 44 55 66
numbers["A"]
# A
# 33
Maybe you should say what you really want, as well as choosing names for your objects that are not function names (names is a function) will help us keep things sorted out in our heads.