This is a very basic question, but I am struggling to solve it. I have a master data frame that I have split into multiple data frames, based on unique values in a particular column. This was achieved by creating a list of data frames, and then saving each data frame as a separate csv file using the lapply function (see the code below).
Example code:
split_df <- split(df, df$ID)
u_ID <- unique(df$ID)
names(split_df) <- paste(u_ID)
lapply(names(split_df), function(x) write.csv(split_df[x], file= paste0(x, '_ID.csv')))
The issue is that the column headers in the output csv files are different to those in the master data frame i.e. in the example below where the data frame is split by unique ID values, the ID name has been added to each column header in the split data frames. I would like to end up with the same column headers in my output data frames as in my master date frame.
Example data:
ID Count Sp
1 A 23 1
2 A 34 2
3 B 4 2
4 A 4 1
5 C 22 1
6 B 67 1
7 B 51 2
8 A 11 1
9 C 38 1
10 B 59 2
dput:
structure(list(ID = c("A", "A", "B", "A", "C", "B", "B", "A",
"C", "B"), Count = c(23L, 34L, 4L, 4L, 22L, 67L, 51L, 11L, 38L,
59L), Sp = c(1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L)), .Names = c("ID",
"Count", "Sp"), class = "data.frame", row.names = c(NA, -10L))
Example output data frames (csv files) using above code:
$A
A.ID A.Count A.Sp
1 A 23 1
2 A 34 2
4 A 4 1
8 A 11 1
$B
B.ID B.Count B.Sp
3 B 4 2
6 B 67 1
7 B 51 2
10 B 59 2
$C
C.ID C.Count C.Sp
5 C 22 1
9 C 38 1
For this example, I would like to end up with output csv files containing the column headers ID, Count and Sp. Any solutions would be greatly appreciated!
Use this simply:
lapply(names(split_df), function(x) write.csv(split_df[[x]], file= paste0(x, '_ID.csv')))
Note double square brackets in split_df[[x]].
Related
I have a complete data frame of all cities from Brazil. I want just some predefined cities. I have a column with these predefined cities. Then I'd like to use all the columns from my data frame, but select only the lines which coincides the cities of column with all cities and the column with predefined cities.
data = read.csv(file="C:/Users/guilherme/Desktop/data.csv", header=TRUE, sep=";")
data
> AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 A 2 4 C 12 5
2 B 2 2 A 11 10
3 C 3 4 F 09 2
4 D 4 2
5 E 5 6
6 F 6 2
I want the following
> data
AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 C 3 4 C 12 5
2 A 2 4 A 11 10
3 F 6 2 F 09 2
You need merge -
merge(
data[, c("AllCities", "Year1990", "Year200")],
data[, c("PredefinedCities", "CharacCities1", "CharacCities2")],
by.x = "AllCities", by.y = "PredefinedCities"
)
AllCities Year1990 Year200 CharacCities1 CharacCities2
1 A 2 4 11 10
2 C 3 4 12 5
3 F 6 2 9 2
Note - Your data format is unusual. If you can, you should fix data source so that it gives you AllCities and PreferredCities tables separately or maybe even join them correctly before creating the csv file.
Data -
structure(list(AllCities = c("A", "B", "C", "D", "E", "F"), Year1990 = c(2L,
2L, 3L, 4L, 5L, 6L), Year200 = c(4L, 2L, 4L, 2L, 6L, 2L), PredefinedCities = c("C",
"A", "F", "", "", ""), CharacCities1 = c(12L, 11L, 9L, NA, NA,
NA), CharacCities2 = c(5L, 10L, 2L, NA, NA, NA)), .Names = c("AllCities",
"Year1990", "Year200", "PredefinedCities", "CharacCities1", "CharacCities2"
), class = "data.frame", row.names = c(NA, -6L))
data <- data[data$AllCities %in% data$PredefinedCities,]
I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)
I am very new to programming with R, but I am trying to replace the column name by the dataframe name with a for loop. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price).
For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to replace the name of the second column ("Close") by the name of the dataframe ("BTC.USD")
For this case I used the following code:
colnames(BTC.USD)[2] <-deparse(substitute(BTC.USD))
And this code works as I imagined:
> head(BTC.USD)
# A tibble: 6 x 2
Date BTC.USD
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
Now I am trying to create a loop to change the second column name for all 25 dataframes of cryptocurrency data:
df_list <- ls(pattern="USD")
for(i in df_list){
aux <- get(i)
(colnames(aux)[2] =df_list)
assign(i,aux)
}
But the code does not work as I thought. Can someone help me figure out what step I am missing?
Thanks in advance!
You can use Map to assign the names, i.e.
Map(function(x, y) {names(x)[2] <- y; x}, l2, names(l2))
#$`a`
# v1 a
#1 3 8
#2 5 6
#3 2 7
#4 1 5
#5 4 4
#$b
# v1 b
#1 9 47
#2 18 48
#3 17 6
#4 5 25
#5 13 12
DATA
dput(l2)
list(a = structure(list(v1 = c(3L, 5L, 2L, 1L, 4L), v2 = c(8L,
6L, 7L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L
)), b = structure(list(v1 = c(9L, 18L, 17L, 5L, 13L), v2 = c(47L,
48L, 6L, 25L, 12L)), class = "data.frame", row.names = c(NA,
-5L)))
The problem is I have a set of data in two columns. Ex:
A B
3 5
6 7
4 4
7 8
1 6
8 7
Here I want to figure out the values that are same in both A & B column(4 & 4). Also I want to know the duplicates that are present in the B column(7 & 7).
After figuring it out, is there a way to remove them and keep in a different file?
Also if you can guide me to a good data manipulation with R content.
We create two indexes for two columns
i1 <- df1$A == df1$B
i2 <- with(df1, duplicated(B)|duplicated(B, fromLast = TRUE))
df1[i1,1]
#[1] 4
df1[i1|i2,2]
#[1] 7 4 7
As the number of elements to be removed are different for both columns, we loop through the columns and remove those values based on the logical index
Map(`[`, df1, list(!i1, !(i1|i2)))
#$A
#[1] 3 6 7 1 8
#$B
#[1] 5 8 6
data
df1 <- structure(list(A = c(3L, 6L, 4L, 7L, 1L, 8L), B = c(5L, 7L, 4L,
8L, 6L, 7L)), .Names = c("A", "B"), class = "data.frame", row.names = c(NA,
-6L))
Your dataframe
db<-data.frame(A=c(3,6,4,7,1,8),
B=c(5,7,4,8,6,7))
Identify equal and duplicated data
not_equal<-!db[,1]==db[,2]
not_duplicated<-!duplicated(db[,2])
Filter out
db[not_equal & not_duplicated,]
A B
1 3 5
2 6 7
4 7 8
5 1 6
I used the aggregate function in R to bring down my data entries from 90k to 1800.
a=test$ID
b=test$Date
c=test$Value
d=test$Value1
sumA=aggregate(c, by=list(Date=b,Id=a), FUN=sum)
sumB=aggregate(d, by=list(Date=b,Id=a), FUN=sum)
final[1]=sumA[1],final[2]=sumA[2]
final[3]=sumA[3]/sumB[3]
Now I have data in 20 different dates in a month with close to 90 different ids each day so its around 1800 entries in the final table .
My question is that I want to aggregate further down and find the maximum value of final[3] for each date so that I am just left with 20 values .
In simple terms -
There are 20 days .
Each day has 90 values for 90 ids
I want to find maximum of these 90 values for each day .
So at last I would be left with just 20 values for 20 days .
Now aggregate function is not working here with function 'max' instead of sum
Date ID Value Value1
1 A 20 10
1 A 25 5
1 B 50 5
1 B 50 5
1 C 25 25
1 C 35 5
2 A 30 10
2 A 25 45
2 B 40 10
2 B 40 30
This is the Data
Now By using Aggregate function I got final table as
Date ID x
1 A 45/15=3
1 B 100/10=10
1 c 60/30=2
2 A 55/55=1
2 B 80/40=2
Now I want maximum value for date 1 and 2 thats it
Date max- Value
1 10
2 2
This is a one step process using data table. The data.table is an evolved version of data.frame, and works really well. It has the class of data.frame, so works just like data.frame.
Step0: Converting data.frame to data.table:
library(data.table)
setDT(test)
setkey(test,Date,ID)
Step1: Do the computation
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
Here the explanation of the step:
The first part creates what you call the final table in your question:
test[,sum(Value)/sum(Value1),by=key(test)]
# Date ID V1
# 1: 1 A 3
# 2: 1 B 10
# 3: 1 C 2
# 4: 2 A 1
# 5: 2 B 2
Now this is passed to the second item to do the max function by Date:
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
# Date V1
# 1: 1 10
# 2: 2 2
Hope this helps.
It's a very well documented package. You should read more about it.
May be this helps.
test <- structure(list(Date = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), ID = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B"),
Value = c(20L, 25L, 50L, 50L, 25L, 35L, 30L, 25L, 40L, 40L
), Value1 = c(10L, 5L, 5L, 5L, 25L, 5L, 10L, 45L, 10L, 30L
)), .Names = c("Date", "ID", "Value", "Value1"), class = "data.frame", row.names = c(NA,
-10L))
res1 <- aggregate(. ~ID+Date, data=test, FUN=sum)
res1 <- transform(res1, x=Value/Value1)
res1
# ID Date Value Value1 x
#1 A 1 45 15 3
#2 B 1 100 10 10
#3 C 1 60 30 2
#4 A 2 55 55 1
#5 B 2 80 40 2
aggregate(. ~Date, data=res1[,-c(1,3:4)], FUN=max)
# Date x
# 1 1 10
# 2 2 2
First I run the aggregate based on two grouping variables (ID and Date) on the two value column by using. ~`
Created a new variable x i.e. Value/Value1 with transform
Did the final run of aggregate with one grouping variable (Date) and removed the rest of the variables except x.