Why as.data.frame doing this in R programming? - r

First of all i would like to tell that I am new to R programming. I was doing some experiment on some R code. I am facing some strange behaviour that I do not expect. I think some one can help me to figure it out.
I ran the following code to read data from a CSV file:
normData= read.csv("normData.csv");
and my normData looks like:
But When I ran the following code to form a Data Frame:
datExpr0 = as.data.frame(t(normData));
I get the following data:
Can some one please tell me, from where the an extra raw (v1,v2,v3,v4,v5,v6) coming from?

Try using:
setNames(as.data.frame(t(normData[-1])), normData[[1]])
However, it might be better to see if you can use the row.names argument in read.table to directly read your "X" as the row names. Then you should be able to directly use as.data.table(t(...)).
Here's a small example to show what's happening:
Start with a data.frame with characters as the first column:
df <- data.frame(A = letters[1:3],
B = 1:3, C = 4:6)
df
# A B C
# 1 a 1 4
# 2 b 2 5
# 3 c 3 6
When you transpose the entire thing, you also transpose that first column (thereby also creating a character matrix).
as.data.frame(t(df))
# V1 V2 V3
# A a b c
# B 1 2 3
# C 4 5 6
So, we drop the column first, and use the values from the column to replace the "V1", "V2"... names.
setNames(as.data.frame(t(df[-1])), df[[1]])
# a b c
# B 1 2 3
# C 4 5 6

Related

Combine rows of data frame in R using colMeans?

I'm impressed by the number of "how to combine rows/columns" threads, but even more by the fact that none of these was particularly helpful or at least not applicable to my issue.
My data look like this:
MyData<-data.frame("id" = c("a","a","b"),
"value1_1990" = c(5,NA,1),
"value2_1990" = c(5,NA,2),
"value1_2000" = c(2,1,1),
"value2_2000" = c(2,1,2),
"value1_2010" = c(NA,9,1),
"value2_2010" = c(NA,9,2))
What I want to do is to combine the two rows where id=="a" for columns MyData[,(2:7)] using base R's colMeans.
What it looks like:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 2 2 NA NA
2 a NA NA 1 1 9 9
3 b 1 2 1 2 1 2
What I need:
id value1_1990 value2_1990 value1_2000 value2_2000 value1_2010 value2_2010
1 a 5 5 1.5 1.5 9 9
2 b 1 2 1 2 1 2
What I tried (among numerous other things):
MyData[nrow(MyData)+1, 2:7] = colMeans(MyData[which(MyData$id=="a"),(2:7)],na.rm=T) # to combine values from rows where id=="a"
MyData$id<-ifelse(is.na(MyData$id),"NewRow",MyData$id) # to replace "<NA>" in the id-column of the newly created row by "NewRow".
This works, except for the fact that...
...it turns all other existing id's into numeric values (and I don't want to let the second line of code -- the ifelse-statement -- touch any of the existing id's, which is why I wrote else==MyData$id).
...this is not particulary fancy code. Is there a one-line-of-code-solution that does the trick? I saw other approaches using aggregate() but this didn't work for me.
You can try using dplyr:
library(dplyr)
Possible solution:
MyData %>% group_by(id) %>% summarise_all(funs(mean(., na.rm = TRUE)))

View dataframes by pasting its name in r

Is there any way to View dataframes in r, while refering to them with another variable? Say I have 10 data frames named df1 to df10, is there a way I can View them while using i instead of 1:10?
Example:
df1 = as.data.frame(c(1:20))
i = 1
View(paste("df", i, sep =""))
I would like this last piece of code to do the same as View(df1). Is there any command or similar in R that allows you to do that?
The answer to your immediate question is get:
df1 <- data.frame(x = 1:5)
df2 <- data.frame(x = 6:10)
> get(paste0("df",1))
x
1 1
2 2
3 3
4 4
5 5
But having multiple similar objects with names like df1, df2, etc in your workspace is considered fairly bad practice in R, and instead experienced R folks will prefer to put related objects in a named list:
df_list <- setNames(list(df1,df2),paste0("df",1:2))
> df_list[[paste0("df",1)]]
x
1 1
2 2
3 3
4 4
5 5

How to import two sets of data in the same excel sheet in R?

Currently in one excel sheet I have one block of data that begins from row 1 and the last row always varies, but it is usually around 18 or 19. Once the first set of data ends then there are two blank rows and the second data set begins, which is also around 18 or 19. The two data sets have the same number of columns and share the same headers. I save the excel sheet as a csv. Then in R I will do read.csv(), but after I have done that I do not know how to separate the two sets of data into separate data.frames.
I realize I could just copy and paste the second data set into a separate excel sheet and read it in, but I do not want to do that. I want to leave the excel sheet untouched.
Example of the excel sheet:
A B C D # FIRST DATA SET
1 2 3 4
A B C D # SECOND DATA SET
5 6 7 8
Any help will be appreciated and please let me know if more info is needed.
There are probably many ways to archieve what you want. Maybe just read it in using readLines, then determine the indices of the two empty lines and use read.csv on the two subsets:
txt <- readLines(con=textConnection("1,2,3,4
5,6,7,8
a,b,c,d,e
f,g,h,i,j"))
read.csv(header=F, text=txt[1:which.max(txt=="")])
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 5 6 7 8
read.csv(header=F, text=txt[(which.max(txt=="")+2):length(txt)])
# V1 V2 V3 V4 V5
# 1 a b c d e
# 2 f g h i j
With regards to your added toy example:
txt <- readLines(con=textConnection("A B C D #1st
1 2 3 4
A B C D #2nd
5 6 7 8"))
txt <- sub("\\s+#.*$", "", txt) # delete comments if necessary
read.table(header=T, check.names = F, text=txt[1:which.max(txt=="")])
# A B C D
# 1 1 2 3 4
read.table(header=T, check.names = F, text=txt[(which.max(txt=="")+2):length(txt)])
# A B C D
# 1 5 6 7 8
That depends. If you know the row number where the first block ends and second one has no header, you can do
mydata <- read.csv('yourfile.csv', header=TRUE)
block1 <- mydata[1:18,]
block2 <- mydata[19:nrow(mydata)]
If your blocks have different structures, like different number of columns, and each block has its own column names, then it’s better to use readLines() function, and pass the result to read.csv. How do you tell those blocks appart?
In reply to your comment:
Then it’s relatively easy. As Kota Mori pointed out, read your data with blank likes. Assuming your first column has numeric values, and no NAs except in between your data sets,
mydata <- read.table('yourfile.csv', header=TRUE, blank.lines.skip = FALSE)
blines <- which(is.na(mydata[,1]))
data1 <- mydata[1:(blines[1]-1),]
data2 <- mydata[(blines[length(blines)]+1):nrow(mydata),]
you should alter the search pattern depending on your data.
This depends on what data file you have.
If you have two empty rows between the two data, letting blank.lines.skip = FALSE in read.csv() would allow you to locate where to split the data.

How to change values in a column of a data frame based on conditions in another column?

I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance
I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.
First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;

R $ operator is invalid for atomic vectors

I have a dataset where one of the columns are only "#" sign. I used the following code to remove this column.
ia <- as.data.frame(sapply(ia,gsub,pattern="#",replacement=""))
However, after this operation, one of the integer column I had changed to factor.
I wonder what happened and how can i avoid that. Appreciate it.
A more correct version of your code might be something like this:
d <- data.frame(x = as.character(1:5),y = c("a","b","#","c","d"))
> d[] <- lapply(d,gsub,pattern = "#",replace = "")
> d
x y
1 1 a
2 2 b
3 3
4 4 c
5 5 d
But as you'll note, this approach will never actually remove the offending column. It's just replacing the # values with empty character strings. To remove a column of all # you might do something like this:
d <- data.frame(x = as.character(1:5),
y = c("a","b","#","c","d"),
z = rep("#",5))
> d[,!sapply(d,function(x) all(x == "#"))]
x y
1 1 a
2 2 b
3 3 #
4 4 c
5 5 d
Surely if you want to remove an offending column from a data frame, and you know which column it is, you can just subset. So, if it's the first column:
df <- df[,-1]
If it's a later column, increment up.

Resources