Similar to Replace a value in a datatable by giving the column index,
I'd like to replace a column in a data.table by another column in the same data.table using column indexes only. (yes, withstanding the fact that this is generally not a good practice. In my case, it is the only way)
DT <- data.table(A=1:5, B=6:10, C=10:14)
and I want
DT[, A:=C]
but not using A and C. only their index numbers 1 and 3.
Edit: needed to elaborate a bit more on my use-case. I have multiple columns that need to be replaced by multiple other columns. The replacements are indicated by two columns in the data.table.
DT <- data.table(A=1:5
, B=6:10
, C=10:14
, D=15:19
, E=20:24
, F=25:29
, G=c(1,2,NA,NA,NA)
, H=c(3,4,NA,NA,NA))
> DT
A B C D E F G H
1: 1 6 10 15 20 25 1 3 # --> column 1 (A) should be replaced by column 3 (C)
2: 2 7 11 16 21 26 2 4 # --> column 2 (B) should be replaced by column 4 (D)
3: 3 8 12 17 22 27 NA NA
4: 4 9 13 18 23 28 NA NA
5: 5 10 14 19 24 29 NA NA
Column G indicates the columns that need to be replaced. Column H indicates the columns that would replace those indicated in column G. Dealing with a data.table of a few thousand columns. and I know the names of columns H and G, so they don't need to be dynamic.
desired outputs:
> desired_output1:
A B C D E F G H
1: 10 15 10 15 20 25 1 3 #all of column A was replaced by column C
2: 11 16 11 16 21 26 2 4 #all of column B was replaced by column D
3: 12 17 12 17 22 27 NA NA
4: 13 18 13 18 23 28 NA NA
5: 14 19 14 19 24 29 NA NA
> desired_output2:
A B C D E F G H
1: 10 6 10 15 20 25 1 3 # col A for this row was replaced by col C
2: 2 16 11 16 21 26 2 4 # col B for this row was replaced by col D
3: 3 8 12 17 22 27 1 2
4: 4 9 13 18 23 28 NA NA
5: 5 10 14 19 24 29 NA NA
Well I don't think there is really any elegant way to accomplish this other than looping the assign statement. So basically you will need to use DT[["G"]][i] for the ith column to be replaced and then DT[["H"]][i] for the replacement column using list notation. In data.table you can refer to the column to be replaced by a number but to get the replacement values you will need to use DT[[DT[["H"]][i]]] which for i=1 would be DT[[3]]. Putting everything together inside an lapply loop would give you the following:
lapply(seq_along(na.omit(DT[["G"]])),function(i) DT[,DT[["G"]][i]:=DT[[DT[["H"]][i]]]])
Since columns G and H will either both contain values or both be NA you can just choose one for the index in lapply in which I chose G. However, make sure that the NA values are at the end of the columns or the seq_along will give you bad indices when executing the loop. I assume based on your description that this will be the case.
Since you really don't care about the list produced by the lapply but only using it as a more efficient for loop you can suppress the output to the console (which may get annoying if you have thousands of columns to change) by wrapping the above with an invisible if you wish:
invisible(lapply(seq_along(na.omit(DT[["G"]])),function(i) DT[,DT[["G"]][i]:=DT[[DT[["H"]][i]]]]))
Hope this helps some!
Related
To clearly state my question, consider the following table:
A B C D
12 3 -1 2
421 4 12 13
11 4 -1 55
4 36 44 18
I have a data set whose entries must be positive and I also have all of the missing value being labelled as -1. The missing values in my data set are only located in column C.
Question: How can I remove all the rows in this example table which contain the value -1 in R?
What I expect after this operation is to be left with rows 2 and 4 only.
either
subset(df1, C >0)
as suggested or
dataset[dataset$C >0,]
You could also use filter from dplyr like this:
df <- read.table(text = "A B C D
12 3 -1 2
421 4 12 13
11 4 -1 55
4 36 44 18", header = TRUE)
library(dplyr)
filter(df, C > 0)
#> A B C D
#> 1 421 4 12 13
#> 2 4 36 44 18
Created on 2022-09-10 with reprex v2.0.2
I have a data that contains character in the first row like this
J K L M N O P
A T F T F F F T
B 14 15 10 2 3 4 78
C 10 47 15 9 6 12 12
D 17 44 17 1 0 15 11
E 3 12 14 3 2 15 17
i want to extract only the columns that contain the value "T" in row A
so the result i want is this :
J L P
A T T T
B 14 10 78
C 10 15 12
D 17 17 11
E 3 14 17
also, in second time, i want to know how to do the same thing using two conditions, for example : extract all columns that contain value "T" in column A and value 17 in row D so the result will be :
J L
A T T
B 14 10
C 10 15
D 17 17
E 3 14
Thank you
Here is your answer.
> df <- df[, df["A",] == "T" & df["D",] == 17]
You can use index to filter columns. It supports logical statements and you can combine them with &.
I'm a beginner in R and I really need your help. I'm trying to get a new dataframe that stores the sum of each five rows in my columns.
For example, I have a dataframe (delta) with two columns (A,B)
A B
2 3
1 2
3 2
4 5
3 7
5 6
2 5
and the output I'm looking for is
AA BB
13 19
16 22
17 25
where
13 = row1+row2+row3+row4+row5
16 = row2+row3+row4+row5+row6
and so on ...
I have no idea where to start. Thanks a lot for your help guys.
The subject refers to 4 rows but the example in the question refers to 5. We have used 5 below but if you intended 4 just replace 5 with 4 in the code.
1) rollsum Using the reproducible input in the Note at the end use rollsum . Omit the as.data.frame if a matrix is ok as output.
library(zoo)
as.data.frame(rollsum(DF, 5))
## A B
## 1 13 19
## 2 16 22
## 3 17 25
2) filter filter in base R works too. Note that if you have dplyr loaded it clobbers filter so in that case use stats::filter in place of filter to ensure you get the correct version.
setNames(as.data.frame(na.omit(filter(DF, rep(1, 5)))), names(DF))
## A B
## 1 13 19
## 2 16 22
## 3 17 25
Note
Lines <- "
A B
2 3
1 2
3 2
4 5
3 7
5 6
2 5"
DF <- read.table(text = Lines, header = TRUE)
Here is a data.table option using frollsum, e.g.,
> na.omit(setDT(df)[,lapply(.SD,frollsum,5)])
A B
1: 13 19
2: 16 22
3: 17 25
or
> na.omit(setDT(df)[,setNames(frollsum(.SD,5),names(.SD))])
A B
1: 13 19
2: 16 22
3: 17 25
I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
For reference, here's the equivalent of #Roland's answer in base R.
Use ave to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that #Roland suggested.
I'm having a difficulty properly shrinking down the row numbers in a data frame.
I have a data set named "mydata" which I imported from a text file using R. The data frame has about 200 rows with 10 columns.
I removed the row number 3, 7, 9, 199 by using:
mydata <- mydata[-c(3, 7, 9, 199),]
When I run this command, the row 3,7,9,199 are gone from the list but the row number doesn't automatically shrink down to 196, but stays at 200. I feel like somehow these row numbers are attached to each "row" as part of the dataframe?
How do I fix this problem?
What puzzles me even more is that when I import the textfile using R Studio, I don't have any problem. (I see 196 when I run the above command). But when using R, I can't change the row number in a dataframe that matches the actual number of rows in the list.
Can anyone please tell me how to fix this??
You can simply do:
rownames(mydata) <- NULL
after performing the subsetting.
For example:
> mydata = data.frame(a=1:10, b=11:20)
> mydata = mydata[-c(6, 8), ]
> mydata
a b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
7 7 17
9 9 19
10 10 20
> rownames(mydata) <- NULL
> mydata
a b
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 7 17
7 9 19
8 10 20
You could also use the data.table package which does not store row.names in the same way (see the data.table intro, instead it will print with the row number.
See the section on keys for how data.table works with row names and keys
data.table inherits from data.frame, so a data.table is a data.frame if functions and pacakges accept only data.frames.
eg
library(data.table)
mydata <- data.table(mydata)
mydata
## a b
## 1: 1 11
## 2: 2 12
## 3: 3 13
## 4: 4 14
## 5: 5 15
## 6: 6 16
## 7: 7 17
## 8: 8 18
## 9: 9 19
## 10: 10 20
mydata = mydata[-c(6, 8), ]
mydata
## a b
## 1: 1 11
## 2: 2 12
## 3: 3 13
## 4: 4 14
## 5: 5 15
## 6: 7 17
## 7: 9 19
## 8: 10 20