So I have two different datasets and I am trying to check if a column name has a duplicate column name in another data set. For example:
V1 V2 V3
1 2 3
as one data set and
V4 V6 V1 V2
NA NA NA NA
And I am trying to make it so the second data set is like this
V4 V6 V1 V2
NA NA 1 NA
where only the minimum value in the original data set copies over, if that makes since. I have tried using this function:
if(ncol((Session1t[grep(temp1, names(Session1t))])) != 0)
But this is not working. It returns the same value regardless of what is input. After entering the if statement I then work to copy only the column that I want over,and I have that figured out, I just cannot get the if statement to work effectively.
We can use ifelse and %in% to match column names and replace NA with 1.
# Create example data frame D1
D1 <- read.table(text = "V1 V2 V3
1 2 3",
header = TRUE)
# Create example data frame D2
D2 <- read.table(text = "V4 V6 V1 V2
NA NA NA NA",
header = TRUE)
# Replace NA to 1 if column names match
D2[1, ] <- ifelse(names(D2) %in% names(D1), 1, NA)
D2
# V4 V6 V1 V2
# 1 NA NA 1 1
Or another option is intersect
nm1 <- intersect(names(df1), names(df2))
df2[nm1] <- df1[nm1]
Related
I have a tab-delimited file that looks like this:
"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
I use this code to read in the data:
df <- read.table("path/to/file",header=TRUE,fill=TRUE)
The result is this:
df
id V1 V2 V3 V4 V5
1 1 A 1 NA NA NA
2 2 B 2 NA NA NA
But I expect this:
df
id V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
I've tried sep="\t" and na.strings=c(""," ",NULL) but those don't help.
I can't get it to work with read.table, so how about parsing the string the manual way
ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
library(tidyverse)
entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Explanation: We split the string on "\t"; the first occurrence of "\n" tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n", reshape as matrix and then as data.frame, fix the header, and let readr::parse_guess guess the data type of every column.
For good measure we can roll everything into a function
read.my.data <- function(s) {
entries <- unlist(str_split(s, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
}
and confirm
read.my.data(ss)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
data.table's fread() had no problem reading in the string... but your data seems to have a \t too many (after each \n), which causes the creation of an extra column.
It is probably best practive to fix this in your export that creates your files.
If this is not possible, you can adjust fread()'s arguments to get the desired output.
Here we use drop do delete the first column that was created due to the the extra \t.
To get the right column-names back, we read the first line of the file again
string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string,
drop = 1,
fill = TRUE,
col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )
ID V1 V2 V3 V4 V5
1: 1 A NA NA NA 1
2: 2 B NA NA NA 2
As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:
> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID V1 V2 V3 V4 V5
1 A 1
2 B 2
That would be ok if the additional first column contained unique row names.
So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.
Here is my suggestion using the second option:
As the data is tab separated, I'd use read.delim which is just read table with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:
> read.delim(text=foo, row.names=NULL)
row.names ID V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
All that's left to do is get rid of the row.names column. Alternatively, you may want the ID column to be turned into row.names:
> read.delim(text=foo, row.names='ID')
row.names V1 V2 V3 V4 V5
1 A NA NA NA 1
2 B NA NA NA 2
Hope that helps.
I have a data set with 399 rows and 7 columns. Each row is made by some NA and some values. What I want to do is to create a new data frame with all the possible combinations of 3 elements for each row. Let's say that row one has 4 elements so I want that the new data frame, on row one, has 4 columns with the standard deviations of all the combinations of 3 elements of row 1(of the original Data Set).
This is the head of the original Data Set:
V1 V2 V3 V4 V5 V6 V7
1 0.0853146 0.0809561 0.1350686 NA NA NA NA
2 0.0788104 0.0964276 0.1222457 0.0853146 NA NA NA
3 0.1086917 0.0818920 0.0479148 0.0981603 0.0788104 NA NA
4 0.0811772 0.1088340 0.1823510 0.0809561 0.0964276 0.1086917 NA
5 0.1015970 0.1089944 0.1243186 0.0858065 0.0842896 0.0818920 0.0811772
6 0.0639869 0.1496792 0.1704337 0.1088340 0.1015970 NA NA
7 0.0619823 0.0962283 0.1089944 0.0639869 NA NA NA
The problem is that I can't remove the NAs so that I get the wrong number of combinations and therefore the wrong number of standard deviations.
Here what I come up with, but it does not work.
mydf<-as.matrix(df, na.rm=TRUE)
row<-apply(mydf, na.rm=TRUE, MARGIN = 1, FUN =combn, m=3, simplify = TRUE)
row<-as.matrix((row))
stdeviation<-apply(row,MARGIN = 1, FUN=sd,na.rm=TRUE)
stdeviation<-as.data.frame(stdeviation)
The table of the combinations looks like this for row 2:
V1 V2 V3
0.0788104313282292 0.0964276223058486 0.122245745410429
0.0788104313282292 0.0964276223058486 0.0853146853146852
0.0788104313282292 0.122245745410429 0.0853146853146852
0.0964276223058486 0.122245745410429 0.0853146853146852
The output for the second column, which I managed to do, looks like
V1 V2 V3 V4
stdeviation 0.02184631 0.008908499 0.02342661 0.01894719
I have the following question:
I have a list (L1) with two parts and each 4 identical variables.
The variable 4 is also the name of the part of the list. e.g. $a = a
a <- data.frame(V1=c("a","b","c"), V2=c(4,7,9), V3=1:3, V4=c("a","a","a"))
b <- data.frame(V1=c("d","e","f"), V2=c(10,14,16), V3=1:3, V4=c("b","b","b"))
L1 <- list(a=a, b=b)
L1
$a
V1 V2 V3 V4
a 4 1 a
b 7 2 a
c 9 3 a
$b
V1 V2 V3 V4
d 10 1 b
e 14 2 b
f 16 3 b
I would like to extract the rows of each part of the list with V3==2. If there is no row in the list with this value V1 to V3 should be extracted with NA and V4 should contain the name of the part of the list.
In the example the outcome should look like this:
V1 V2 V3 V4
b 7 2 a
e 14 2 b
If I select a value e.g. V3==4 then my result should look like this:
V1 V2 V3 V4
<NA> <NA> <NA> a
<NA> <NA> <NA> b
I can extract a column with
unlist(lapply(L1, "[",3)) but I can't figure out how to extract rows which have a certain value in a variable.
I also tried to combine lapply with the subset function, but this didn't work for me.
Thank's for your help!
This should work. The first command returns a list, the second one converts it to a data frame. If the value is not in the data, it returns NA (for the list) or a row of NAs (for the df).
l <- lapply(L1, function(x) {i <- which(x$V3 == 2)
if (length(i) > 0) x[i, ]
else NA })
df <- rbind(l[[1]], l[[2]])
We could create a function using data.table. We rbind the list elements with rbindlist, grouped by 'V4', if the 'V3' is not equal to the given value, we return the NA elements (.SD[.N+1]) or else return the Subset of Data.table (.SD[tmp]).
library(data.table)
f1 <- function(lst, val){
rbindlist(lst)[, {tmp <- V3==val
if(!any(tmp)) .SD[.N+1]
else .SD[tmp]},
by = V4][, names(lst[[1]]), with=FALSE]
}
f1(L1, 4)
# V1 V2 V3 V4
#1: NA NA NA a
#2: NA NA NA b
f1(L1, 3)
# V1 V2 V3 V4
#1: c 9 3 a
#2: f 16 3 b
f1(L1, 2)
# V1 V2 V3 V4
#1: b 7 2 a
#2: e 14 2 b
You can also bind_rows with dplyr
list(a = a, b = b) %>%
bind_rows(.id = "source") %>%
filter(V2 == 2)
I want to transpose the output given by the last command and write it to a data.frame. I want that dataframe to have 2 columns. First column will have column names and the second column will have data type for the column in each row. How could I achieve it? I tried variety of things but didnt get what I am looking for
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
smoke <- as.data.frame(smoke)
table1=sapply (smoke, class)
table1
You could also skip the table1 part and go straight from smoke to the desired result.
> data.frame(nm = names(smoke), cl = sapply(unname(smoke), class))
# nm cl
# 1 V1 numeric
# 2 V2 numeric
# 3 V3 numeric
You could try this:
data.frame(var.name = names(table1), var.class = table1, row.names=NULL)
# var.name var.class
#1 V1 numeric
#2 V2 numeric
#3 V3 numeric
You might be looking for the melt command.
library(reshape2)
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
smoke <- as.data.frame(smoke)
table1 <- sapply (smoke, class)
smoke.melt <- melt(smoke)
levels(smoke.melt$variable) <- table1
> smoke.melt
variable value
1 numeric 51
2 numeric 92
3 numeric 68
4 numeric 43
5 numeric 28
6 numeric 22
7 numeric 22
8 numeric 21
9 numeric 9
Just convert table1 to data.frame and adjust:
dd = data.frame(table1)
dd
table1
V1 numeric
V2 numeric
V3 numeric
dd$VarName = rownames(dd)
dd
table1 VarName
V1 numeric V1
V2 numeric V2
V3 numeric V3
dd = dd[,c(2,1)]
dd
VarName table1
V1 V1 numeric
V2 V2 numeric
V3 V3 numeric
names(dd)[2] = "type"
dd
VarName type
V1 V1 numeric
V2 V2 numeric
V3 V3 numeric
I have a data frame with three variables and 250K records. As an example consider
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
V1 V2 V3
1 a 2
2 a 3
4 b 1
and want to swap values between V1 and V3 based on the value of V2 as follows:
if V2 == 'b' then V1 <- V3 and V3 <- V1
resulting in
V1 V2 V3
1 a 2
2 a 3
1 b 4
I tried a do loop but it takes forever. If I use Perl, it takes seconds. I believe this task can be done efficiently in R as well. Any suggestions are appreciated.
Try this
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
df[df$V2 == "b", c("V1", "V3")] <- df[df$V2 == "b", c("V3", "V1")]
which yields:
> df
V1 V2 V3
1 1 a 2
2 2 a 3
3 1 b 4
You can use transform to do this.
df <- transform(df, V3 = ifelse(V2 == 'b', V1, V3), V1 = ifelse(V2 == 'b', V3, V1))
Editted I got tripped up with column names, sorry. This works.
If you don't mind the rows ending up in different orders, this is kind of a 'cute' way to do this:
dat <- read.table(textConnection("V1 V2 V3
1 a 2
2 a 3
4 b 1"),sep = "",header = TRUE)
tmp <- dat[dat$V2 == 'b',3:1]
colnames(tmp) <- colnames(dat)
rbind(dat[dat$V2 != 'b',],tmp)
Basically, that's just grabbing the rows where V2 == 'b', reverses the columns and slaps it back together with everything else. This can be extended if you have more columns that don't need switching; you'd just use an integer index with those values transposed, rather than just 3:1.