Splitting a dataframe by columns - r

I want to split my dataframe by columns. Sounds trivial, but i didnt really succeed so far.
Here is what i have come up with:
SG <- data.frame(num = 1:26, let = letters, LET = LETTERS)
SG <- lapply(SG, function(x) split(x, colnames(SG)))
str(SG)
List of 3
$ num:List of 3
$ let:List of 3
$ LET:List of 3
I have successfully converted my dataframe into a list of lists. But i would like to have a list of dataframes, preserving the rowname info from SG, and each one of them containing one column of the initial dataframe. Is that possible?
Thank you!

This should work, row names are preserved. It returns a list of data frames:
SG <- lapply(SG, data.frame)
str(SG)
List of 3
$ num:'data.frame': 26 obs. of 1 variable:
..$ X..i..: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
$ let:'data.frame': 26 obs. of 1 variable:
..$ X..i..: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ LET:'data.frame': 26 obs. of 1 variable:
..$ X..i..: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...

You can use
lapply(colnames(SG), function(x) SG[,x,drop=F])
which returns an object with the structure
List of 3
$ :'data.frame': 26 obs. of 1 variable:
..$ num: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 26 obs. of 1 variable:
..$ let: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 26 obs. of 1 variable:
..$ LET: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
In this case we are just subsetting. split() for data.frames is better when you want to separate the rows, not columns, into different groups.

Related

Rename variables for multiple dataframe, using a loop, referencing the dataframe names from a list

Difficult to articulate. Here is an example to explain.
I have 3 dataframes.
df1 <- data.frame(var1=c(1:5),var2=seq(1,10,by=1) )
df2 <- data.frame(var1=c(6:10),var2=seq(1,10,by=1) )
df3 <- data.frame(var1=c(11:15),var2=seq(1,10,by=1) )
I have a list with those dataframe names
df_list <- c("df1","df2","df3")
I'm trying to rename the all the variables within each of those dataframes to be "VALUE"
I can do it for each dataframe with a line of code like this
names(df1)[1:ncol(df1)]<-paste("VALUE")
Sometimes I may have several dataframes. Rather than write hundreds of lines of the same code, I'd like to do this with a loop. I tried this but without luck.
for (i in 1:length(df_list)){
names(get(df_list[i]))[1:ncol(get(df_list[i]))]<-paste("VALUE")
}
Is there a way of doing this with a loop? Any help is greatly appreciated. Thanks
Expected output would be VALUE as variable name for all variables in each dataframe
> df1
VALUE VALUE
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 1 6
7 2 7
8 3 8
9 4 9
10 5 10
We can get the values of the object names with mget into a list, loop over the list with lapply, set the column names to replicated 'VALUE' (not recommended at all - as data.frame column names should be unique)
lst1 <- lapply(mget(df_list), function(x) setNames(x, rep("VALUE", ncol(x))))
There are probably good reasons not to name all your columns the same, but a nested loop works:
df1 <- data.frame(var1=c(1:5),var2=seq(1,10,by=1) )
df2 <- data.frame(var1=c(6:10),var2=seq(1,10,by=1) )
df3 <- data.frame(var1=c(11:15),var2=seq(1,10,by=1) )
df_list <- list(df1, df2, df3)
for (i in 1:length(df_list)) {
for (j in 1:length(names(df_list[[i]]))) {
names(df_list[[i]])[j] <- 'VALUE'
}
}
str(df_list)
List of 3
$ :'data.frame': 10 obs. of 2 variables:
..$ VALUE: int [1:10] 1 2 3 4 5 1 2 3 4 5
..$ VALUE: num [1:10] 1 2 3 4 5 6 7 8 9 10
$ :'data.frame': 10 obs. of 2 variables:
..$ VALUE: int [1:10] 6 7 8 9 10 6 7 8 9 10
..$ VALUE: num [1:10] 1 2 3 4 5 6 7 8 9 10
$ :'data.frame': 10 obs. of 2 variables:
..$ VALUE: int [1:10] 11 12 13 14 15 11 12 13 14 15
..$ VALUE: num [1:10] 1 2 3 4 5 6 7 8 9 10
It can be done, but should it?

why column is changing to data.frame after using of match in R?

One columns is changing to data.frame after the use of match condition in R.
Can any one let me know why a column is updating to factor to data.frame after updating ?
a<-data.frame(columnB=sample(1:20,20,replace = F),
columnC=sample(4:80,20,replace = F))
d<-data.frame(columnE=letters[1:20],
columnF=sample(1:20,20,replace = F))
a$columnB<-d[match(a$columnB,d$columnF),]
str(a)
Output :
> str(a)
'data.frame': 20 obs. of 2 variables:
$ columnB:'data.frame': 20 obs. of 2 variables:
..$ columnE: Factor w/ 20 levels "a","b","c","d",..: 18 8 1 19 16 3 4 15 17 5 ...
..$ columnF: int 6 20 12 11 13 1 7 19 14 8 ...
$ columnC: int 69 6 37 80 55 49 4 5 44 76 ...
1.please clarify how it get resolved to make data frame column to normal columns
2.is there any method to easy match and update of column in a table based on d table.

Change type of variables in multiple data frames

I have a list of data frames:
str(df.list)
List of 34
$ :'data.frame': 506 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:506] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALAT": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:506] 23 23 23 24 25 24 20 34 28 17 ...
..$ Index : Factor w/ 502 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ...
$ :'data.frame': 505 obs. of 7 variables:
..$ Protocol : Factor w/ 5 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ Time : num [1:505] 0 2 3 0.5 6 1 24 24 24 24 ...
..$ SampleID : Factor w/ 40 levels "P1T0","P1T0.5",..: 1 5 7 2 8 3 6 6 6 6 ...
..$ VolunteerID: Factor w/ 15 levels "ID-02","ID-03",..: 10 10 10 10 10 10 10 11 13 14 ...
..$ Assay : Factor w/ 1 level "ALB": 1 1 1 1 1 1 1 1 1 1 ...
..$ ResultAssay: int [1:505] 45 46 47 47 49 47 46 46 44 43 ...
..$ Index : Factor w/ 501 levels "P1T0.5VID-02",..: 8 31 37 2 43 19 25 26 28 29 ..
The list contains 34 data frames with equal variable names. The variables Time and ResultAssay are of the wrong type: I would like to have Time as factor and ResultAssay as numerical.
I am trying to generate a function to use together with lapply to convert the variable type of this list of 34 data frames in one go, but so far i am unsuccessful.
I have tried things in parallel to:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
}
lapply(df.list, ChangeType)
What you have tried is nearly correct, but you also need to return the new data.frame and also store it to your existing variable, as so:
ChangeType <- function(DF){
DF[,2] <- as.factor(DF[,2])
DF[, "ResultAssay"] <- as.numeric(DF[, c("ResultAssay")]
DF #return the data.frame
}
# store the returned value to df.list,
# thus updating your existing data.frame
df.list <- lapply(df.list, ChangeType)

Convert Factors in 2 Data Frames of a List into Numeric

I am having trouble converting the columns of 2 data frames in a list to numeric. Right now both data frames have 2 columns consisting of factors. I want to convert them to numeric so that I can do mathematical operations on them. Below is sample code:
library(XML)
bal <- "http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml"
bos <- "http://www.baseball-reference.com/teams/BOS/2014-schedule-scores.shtml"
mylist <- list(bal, bos)
a <- lapply(mylist, readHTMLTable)
b <- lapply(a, function(x) x[["team_schedule"]][, c("R", "RA")])
c <- as.numeric(as.character(b))
When I run this code I get:
> c
[1] NA NA
> str(c)
num [1:2] NA NA
Here is the structure of b:
> str(b)
List of 2
$ :'data.frame': 165 obs. of 2 variables:
..$ R : Factor w/ 13 levels "","0","10","11",..: 6 6 7 8 10 7 6 5 9 2 ...
..$ RA: Factor w/ 13 levels "","0","1","10",..: 3 9 7 4 10 3 7 8 7 6 ...
$ :'data.frame': 166 obs. of 2 variables:
..$ R : Factor w/ 10 levels "","0","1","2",..: 3 8 6 4 8 2 7 9 6 3 ...
..$ RA: Factor w/ 13 levels "","1","10","14",..: 5 5 6 9 10 7 2 3 5 7 ...
What should I do differently to convert the factors into numeric values?
You need to use lapply. do a str on "b"
str(b)
This will let you know you have a list of 2 of 2 data.frames.
So you need to use lapply along with sapply, to preserve the data structure
lapply(b, function(x) sapply(x, function(x) as.numeric(as.character(x))))
You have D/N in your factor, which will be converted to NAs and also the list entries
that are blank/empty

Treatment of 'empty' values

I am importing a csv file into R using the sqldf-package. I have several missing values for both numeric and string variables. I notice that missing values are left empty in the dataframe (as opposed to being filled with NA or something else). I want to replace the missing values with an user defined value. Obviously, a function like is.na() will not work in this case.
Toy dataframe with three columns:
A B C
3 4
2 4 6
34 23 43
2 5
I want:
A B C
3 4 NA
2 4 6
34 23 43
2 5 NA
Thank you in advance.
Assuming you are using read.csv.sql in sqldf with the default sqlite database it is producing a factor column for C so
(1) just convert the values to numeric using as.numeric(as.character(...)) like this:
> Lines <- "A,B,C
+ 3,4,
+ 2,4,6
+ 34,23,43
+ 2,5,
+ "
> cat(Lines, file = "stest.csv")
> library(sqldf)
> DF <- read.csv.sql("stest.csv")
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: Factor w/ 3 levels "","43","6": 1 3 2 1
> DF$C <- as.numeric(as.character(DF$C))
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: num NA 6 43 NA
(2) or if we use sqldf(..., method = "raw") then we can just use as.numeric:
> DF <- read.csv.sql("stest.csv", method = "raw")
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: chr "" "6" "43" ""
> DF$C <- as.numeric(DF$C)
> str(DF)
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: num NA 6 43 NA
(3) If its feasible for you to use read.csv then we do get NA filling right off:
> str(read.csv("stest.csv"))
'data.frame': 4 obs. of 3 variables:
$ A: int 3 2 34 2
$ B: int 4 4 23 5
$ C: int NA 6 43 NA

Resources