Given a table with a variety of values and lengths, what's the best way to create a dataframe for columnar analysis?
Example, given an unlabeled CSV that looks like this:
A,B,A,C
A,B,C,D,E,F
B,C,A,B,F,F,F
A,B
B,C,D
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,X,Z,AA,AB,AC
The goal will be to eventually assign a value to each letter based on what position it appears in.
Given the variable, and unknown length of the rows, how should I approach this problem? Set up a dataframe with an absurdly large number of columns as a placeholder?
One option is to read each row as an element in a vector using readLines() -
x <- readLines("test.csv") # add appropriate path to the file
x
[1] "A,B,A,C" "A,B,C,D,E,F"
[3] "B,C,A,B,F,F,F" "A,B"
[5] "B,C,D" "A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,Y,X,Z,AA,AB,AC"
Now you can manipulate each element of this vector as you wish and then assemble the results in your desired structure. This way you don't have to "Set up a dataframe with an absurdly large number of columns as a placeholder".
Related
I have a large database which divides files into chunks for ease of analysis/storage and I am trying to extract multiple specific values stored in the character format from a single column to take a chunk of the overall data for further analysis.
Within these files I am interested in pulling ALL rows in which column "Cat" is equals any number of characters (different for each pull and each file).
Files are set up (for example) as:
2001_x.sas
2002_x.sas
....
2018_x.sas
Currently, I am doing the following:
#Create a list of files- fill out pattern to chose specific files with similar names
'x<-list.files(pattern = "_x.sas")'
#Read and subset files when Cat is C21 C98 or D27 etc
'z<-lapply(x, function(x) {
a<-read.sas(x)
c<-subset(a, (Cat=="C21" | Cat=="C98 | Cat=="D27))
})'
#Bind df's into master df
'y<-bind_rows(z)'
y is then a really nice pull from multiple files at once. The advantage of this, as the total dataset is several terabytes, is that this works within the individual files and doesn't overwhelm the memory on my desktop.
The problem is that I can't always use Cat equals variable with just three values. Sometimes, I need to input hundreds of values, which is very tedious. I have tried replacing this with lists or vectors.
Ideally, I'd like the code to look more like this, if you know what I mean, but this doesnt work:
'b<-List or vector with character values of interest
z<-lapply(x, function(x) { a<-read.sas(x) subset(a, Cat==any(b)) })
y<-bind_rows(z)'
Such that any value in list b would be included in the subset if it equals Cat. However, I've only been able to get this to work with Cat equals variable and the or symbol.
Thanks!
I want to apply a for loop on a non-specified range of rows in a tibble.
I have to modify the following code that applies a for loop on a specific number of rows in a tibble:
for(times in unique(dat$wname)[1:111]){...}
In this tibble the range from 1:111 corresponds to a specific file, in fact, the value of the column "File" repeat 111 times. However, in my data, I do not know how many times the same file repeat. For example, I can have a file that repeats for 80 rows and another for 85. How do I tell the loop to look only in the range in which the rows in the column file have the same name?
I need something to say:
for(times in unique(dat$wname)["for each row in the column File with the same name"]){...}
How can I do it?
You can count the number of rows using nrow, or ncol if you want to use columns,in your dat variable or lenght in unique(dat$wname) and do something like this:
rows= nrow(dat) # OR
rows=length(unique(dat$wname))
for(times in unique(dat$wname)[1:rows]){...}
But a reproducible example would make things a lot easier to understand/answer
I would like to assign names to rows in R but so far I have only found ways to assign names to columns. My data is in two columns where the first column (geo) is assigned with the name of the specific location I'm investigating and the second column (skada) is the observed value at that specific location. To clarify, I want to be able to assign names for every location instead of just having them all in one .txt file so that the data is easier to work with. Anyone with more experience than me that knows how to handle this in R?
First you need to import the data to your global environment. Try the function read.table()
To name rows, try
(assuming your data.frame is named df):
rownames(df) <- df[, "geo"]
df <- df[, -1]
Well, your question is not that clear...
I assume you are trying to create a data.frame with named rows. If you look at the data.frame help you can see the parameter row.names description
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
which means you can manually specify the row names when you create the data.frame or the column containing the names. The former can be achived as follows
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
row.names=letters[1:10] # take the first 10 letters and use them as row header
)
while the latter is
d = data.frame(x=rnorm(10), # 10 random data normally distributed
y=rnorm(10), # 10 random data normally distributed
r=letters[1:10], # take the first 10 letters
row.names=3 # the column with the row headers is the 3rd
)
If you are reading the data from a file I will assume you are using the command read.table. Many of its parameters are the same of data.frame, in particular you will find that the row.headers parameter works the same way:
a vector of row names. This can be a vector giving the actual row names, or a single number giving the column of the table which contains the row names, or character string giving the name of the table column containing the row names.
Finally, if you have already read the data.frame and you want to change the row names, Pierre's answer is your solution
I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.
I have a huge dataframe of around 1M rows and want to split the dataframe based on one column & different ranges.
Example dataframe:
length<-sample(rep(1:400),100)
var1<-rnorm(1:100)
var2<-sample(rep(letters[1:25],4))
test<-data.frame(length,var1,var2)
I want to split the dataframe based on length at different ranges (ex: all rows for length between 1 and 50).
range_length<-list(1:50,51:100,101:150,151:200,201:250,251:300,301:350,351:400)
I can do this by subsetting from the dataframe, ex: test1<-test[test$length>1 &test$length<50,]
But i am looking for more efficient way using "split" (just a line)
range = seq(0,400,50)
split(test, cut(test$length, range))
But do heed Justin's suggestion and look into using data.table instead of data.frame and I'll also add that it's very unlikely that you actually need to split the data.frame/table.