Restructuring DataFrame Based on Single Column Values - r

I am trying to move data from one column to another based on multiple existing values. I researched and found a simple solution for a single column - as seen in the current code below. However, I would like a way to do it for all rows. I've been trying to research a way, but cannot seem to find a way to apply a possible loop to this function. Any help would be great. I am using the latest version of R, and RStudio. Thanks!
CURRENT DATAFRAME:
Row #People
A 3
A 2
A 2
B 1
B 1
C 3
C 3
C 2
C 1
Desired DataFrame:
Row: A B C
3 1 3
2 1 3
2 2
1
Current Code:
files <- read.csv("SampleData3.csv", header = T)
subset<-as.data.frame(files[files$RowID == A, "DisRank"])

Try the following:
library("qpcR")
do.call(qpcR:::data.frame.na,split(df$X.People, df$Row))
A B C
1 3 1 3
2 2 1 3
3 2 NA 2
4 NA NA 1

Here's a tidyverse way of doing it using tidyr::spread. You'll also need to add row numbers, which I get rid of in the end by using dplyr's select(-id).
Start by creating the data:
df = read.table(text="Row People
A 3
A 2
A 2
B 1
B 1
C 3
C 3
C 2
C 1", header = TRUE)
Now do the work:
library(tidyverse)
df %>%
group_by(Row) %>%
mutate(id = row_number()) %>%
spread(key = Row, value = People) %>%
select(-id)

As far as I know, your desired DataFrame is not a valid DataFrame in R. So it is impossible. You should explain why you want something like that. There are other data types like lists that can store data in a structure like that, but I haven't a clue what you want to do afterwards.

How about reshape2::dcast(.~Row, data = dta, fun.aggregate=list)[, -1]. This will give you a data.frame with list in a cell.
Added for output
A B C
1 3, 2, 2 1, 1 3, 3, 2, 1

Related

How to write a function to calculate the mean of some columns in a dataframe in r?

I need to add a new column with the results calculated from the mean value of other columns.
For example:
A B C D E
1 2 3 4 ?
the question mark should equal to mean(2, 3, 4)
I wrote my code like this
df_new <- df %>% mutate(new_column = rowMeans(dplyr::select(., B:D))
But because I have a really big data frame, I have to repeat this process many times, is it possible for me to write a function to make it easier? I really don't know where to start.
If your data.frame looks like this:
df <- data.frame(A=1,B=2,C=3,D=4)
df
A B C D
1 1 2 3 4
you can get the mean you asked for like this:
data.frame(df,E=mean(as.numeric(df[,2:4])))
A B C D E
1 1 2 3 4 3
or means for a data.frame with more rows like this:
data.frame(df,E=rowMeans(df[,2:4]))

Repeat (duplicate) just one row twice in R

I'm trying to duplicate just the second row in a dataframe, so that row will appear twice. A dplyr or tidyverse aproach would be great. I've tried using slice() but I can only get it to either duplicate the row I want and remove all the other data, or duplicate all the data, not just the second row.
So I want something like df2:
df <- data.frame(t = c(1,2,3,4,5),
r = c(2,3,4,5,6))
df1 <- data.frame(t = c(1,2,2,3,4,5),
r = c(2,3,3,4,5,6))
Thanks!
Here's also a tidyverse approach with uncount:
library(tidyverse)
df %>%
mutate(nreps = if_else(row_number() == 2, 2, 1)) %>%
uncount(nreps)
Basically the idea is to set the number of times you want the row to occur (in this case row number 2 - hence row_number() == 2 - will occur twice and all others occur only once but you could potentially construct a more complex feature where each row has a different number of repetitions), and then uncount this variable (called nreps in the code).
Output:
t r
1 1 2
2 2 3
2.1 2 3
3 3 4
4 4 5
5 5 6
One way with slice would be :
library(dplyr)
df %>% slice(sort(c(row_number(), 2)))
# t r
#1 1 2
#2 2 3
#3 2 3
#4 3 4
#5 4 5
#6 5 6
Also :
df %>% slice(sort(c(seq_len(n()), 2)))
In base R, this can be written as :
df[sort(c(seq(nrow(df)), 2)), ]

Removing rows from a dataset based on conditional statement across factors

I am struggling to figure out how to remove rows from a dataset based on conditions across multiple factors in a large dataset. Here is some example data to illustrate the problem I am having with a smaller data frame:
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
data$Value <- (as.numeric(data$Value))
data
Code Value
1 A 1
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I want to remove values where the Code is A and the Value is < 2 from the dataset. I understand the logic of how to select for values where Code is A and Values <2, but I can't figure out how to remove these values from the dataset without also removing all values of A that are > 2, while maintaining values of the other codes that are less than 2.
#Easy to select for values of A less than 2
data2<- subset(data, (Code == "A" & Value < 2))
data2
Code Value
1 A 1
#But I want to remove values of A less than 2 without also removing values of A that are greater than 2:
data1<- subset(data, (Code != "A" & Value > 2))
data1
Code Value
3 C 3
4 D 4
### just using Value > 2 does not allow me to include values that are less than 2 for the other Codes (B,C,D):
data2<- subset(data, Value > 2)
data2
3 C 3
4 D 4
7 A 3
8 A 4
My ideal dataset would look like this:
data
Code Value
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I have tried different iterations of filter(), subset(), and select() but I can't figure out the correct conditional statement that allows me to remove the desired combination of levels of multiple factors. Any suggestions would be greatly appreciated.

How to take the latest entry from a data.frame and store it in new dataframe

I have a data.frame that is full of data, and where the data for parameters repeat itself, but I want to use the latest information that is stored.
Thankfully I have an index in the files that tells me which duplicate is he current row in the data.frame.
Example for my problem is the following:
A B C D
1 1 2 3 1
2 1 2 2 2
3 3 4 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
A small explanation ... A and B columns can be considered key, and the C column represents value for that key ... the column D represents the index of the measurement .. but it does not have to start from 1 ... it can start from 3,6, ... any integer. This is happening because the data is incomplete
So at the end the output should be like:
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
Can you please help me program a make an R program, or point me to the right direction, that is going to save all the keys with the their latest index ...
I have tried using for loops but it didn't work ....
Sincerely thanks
If you have any question feel free to ask
Using duplicated and subsetting in base R, you can do
dat[!duplicated(dat[,1:2], fromLast=TRUE),]
A B C D
2 1 2 2 2
4 3 4 1 3
5 2 3 2 1
6 2 1 1 1
duplicated returns a logical vector indicating whether a row (here the first two columns) has been duplicated. The fromLast argument initiates this process from the bottom of the data.frame.
You can use dplyr verbs to group your data group_by, then sort arrange. The do verb allows you to operate at the group-level. tail grabs the last row of each group...
library(dplyr)
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
do(tail(.,1)) %>%
ungroup()
Thanks to Frank's suggestion, you could also use slice
df1 <- df %>%
group_by(A,B) %>%
arrange(D) %>%
slice(n()) %>%
ungroup()

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

Resources