Using loop variables - r

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column

You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Related

Why does dropping which(FALSE) columns delete all columns?

This answer warns of some scary behavior from which. Specifically, if you take any data frame, say df <- data.frame(x=1:5, y=2:6), and then try to subset it with something that evaluates to which(FALSE) (i.e. integer(0)), then you will delete every column in the data set. Why is this? Why would dropping all columns that correspond to integer(0) delete everything? Deleting nothing shouldn't destroy everything.
Example:
>df <- data.frame(x=1:5, y=2:6)
>df
x y
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
>df <- df[,-which(FALSE)]
>df
data frame with 0 columns and 5 rows
Consider:
identical(integer(0), -integer(0))
# [1] TRUE
So, actually you're selecting nothing, rather than deleting nothing.
If you want to delete nothing, you could use a large negative integer, e.g. the largest possible.
df[, -.Machine$integer.max]
# x y
# 1 1 2
# 2 2 3
# 3 3 4
# 4 4 5
# 5 5 6

Extracting a specific type columns and specific named columns from a data frame-R

Let I have a data frame where some colums rae factor type and there is column named "index" which is not a column. I want to extract columns
which are factor tyepe and
the "index" column.
For example let
df<-data.frame(a=runif(10),b=as.factor(sample(10)),index=as.numeri(1:10))
So df is:
a b index
0.16187501 5 1
0.75214741 8 2
0.08741729 3 3
0.58871514 2 4
0.18464752 9 5
0.98392420 1 6
0.73771960 10 7
0.97141474 6 8
0.15768011 7 9
0.10171931 4 10
Desired output is(let it be a data frame called df1)
df1:
b index
5 1
8 2
3 3
2 4
9 5
1 6
10 7
6 8
7 9
4 10
which consist the factor column and the column named "index".
I use such a code
vars<-apply(df,2,function(x) {(is.factor(x)) || (names(x)=="index")})
df1<-df[,vars]
However, this code does not work. How can I return df1 using apply types function in R? I will be very glad for any help. Thanks a lot.
You could do:
df[ , sapply(df, is.factor) | grepl("index", names(df))]
I think two things went wrong with your method: First, apply converts the data frame to a matrix, which doesn't store values as factors (see here for more on this). Also, in a matrix, every value has to be of the same mode (character, numeric, etc.). In this case, everything gets coerced to character, so there's no factor to find.
Second, the column name isn't accessible within apply (AFAIK), so names(x) returns NULL and names(x)=="index" returns logical(0).

Finding the top values in data frame using r

How can I find the 5 highest values of a column in a data frame
I tried the order() function but it gives me only the indices of the rows, wherease I need the actual data from the column. Here's what I have so far:
tail(order(DF$column, decreasing=TRUE),5)
You need to pass the result of order back to DF:
DF <- data.frame( column = 1:10,
names = letters[1:10])
order(DF$column)
# 1 2 3 4 5 6 7 8 9 10
head(DF[order(DF$column),],5)
# column names
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
You're correct that order just gives the indices. You then need to pass those indices to the data frame, to pick out the rows at those indices.
Also, as mentioned in the comments, you can use head instead of tail with decreasing = TRUE if you'd like, but that's a matter of taste.

R: How can I remove rows from all the data frames in this list?

Say I have some data created like this
n <- 3
K <- 4
dat <- expand.grid(var1=1:n, var2=1:K)
dat looks like this:
var1 var2
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
I want to remove some rows from both data frames in the list at the same time. Let's say I want to remove the 11th row, and I want the 'gap' to be filled in, so that now the 12th row will become the 11th row.
I understand this is a list of two data frames. Thus the advice here does not apply, since dat[[11]]<-NULL would do nothing, while dat[[2]]<-NULL would remove the second data frame from the list
lapply(dat,"[",11) lets me access the relevant elements, but I don't know how to remove them.
Assuming that we want to remove rows from a list of data.frames, we loop the list elements using lapply and remove the rows using numeric index.
lapply(lst, function(x) x[-11,])
Or without the anonymous function
lapply(lst, `[`, -11,)
The 'dat' is a data.frame.
is.data.frame(dat)
#[1] TRUE
If we want to remove rows from 'dat',
dat[-11,]
If the row.names also needs to be changed
`row.names<-`(dat[-11,], NULL)
data
lst <- list(dat, dat)

Import multiple data frames CSV - column separation

I have a csv file with multiple data frames that are all separated by a column (So 4 columns of data, empty column, 4 columns of data, etc.). Is there a nice way to read in the file and have R create a separate df for each of those contiguous sets of columns? Then I would be able to use lapply across all of these dfs.
Thanks for your help.
Read in the whole csv file, then use lapply to separately capture each four-column data frame into a list. Then use rbind to stack all the data frames into a single data frame.
dat = read.csv("YourFile.csv")
# Set this based on how many separate data frames are in your csv file
num.df = ncol(dat)/5 # Per #zx8754's comment
# This will tell the function the column numbers where
# each data frame starts
start.cols = seq(1, 1 + 5*(num.df-1), 5)
df.list = lapply(start.cols, function(x) {
# Capture the next 4 columns
df = dat[, x:(x+3)]
# Use whatever names are appropriate here. This is just
# to make sure all of the data frames have the same column names
# so that rbind won't throw an error
names(df) = c(paste0("col", 1:4))
return(df)
})
# rbind all the data frames into a single data frame
df = do.call(rbind, df.list)
You can take advantage of colClasses:
Example data:
h1 h2 h3 h1.1 h2.1 h3.1 h1.2 h2.2 h3.2
1 1 6 3 1 8 8 1 5 2
2 2 1 1 6 5 8 1 3 1
3 3 2 6 1 2 3 1 2 5
Then you can loop through the number of dataframes you wan't and read the file:
ngroups <- 3 #number of dataframes to read
datacols <- 3 #number of columns to read
fulldata <- list()
for (i in 1:ngroups) {
nskip <- (datacols+1)*(i-1)
cols.to.read <- c(rep("NULL", nskip), rep(NA, datacols), rep("NULL", (datacols+1)*(ngroups-i+1)-1)) #creates a list of NULLs and NAs. NULLs = don't read, NA = read
fulldata[[i]] <- read.csv("test.csv", colClasses=cols.to.read)
}
Result:
fulldata
[[1]]
h1 h2 h3
1 1 6 3
2 2 1 1
3 3 2 6
[[2]]
h1.1 h2.1 h3.1
1 1 8 8
2 6 5 8
3 1 2 3
[[3]]
h1.2 h2.2 h3.2
1 1 5 2
2 1 3 1
3 1 2 5
This works, but I believe the answers reading the file only once would be faster, since reading the same file over and over again doesn't sound like the optimal procedure.
First read in all your data into one large dataframe:
maindf <- read.table(yourfile)
Lets say n is the number of dataframes inside your csv file:
for (i in 0:n-1){
assign(paste0("df",i+1),maindf[,(1+4*i):(4+4*i)])
}
The result should be n dataframes that can be accessed like this: df1, df2,...dfn.
I didnt test it, because no sample data was provided.

Resources