R 常用技巧 - HackMD

文章推薦指數: 80 %
投票人數:10人

tags: `R` `Data Processing` `資料前處理` # R 常用技巧Other Reference: 1. ... 實現分組統計的一種簡便、直接的方式,且能同時指定多個函數; 回傳結果為一矩陣, ...       Published LinkedwithGitHub Like1 Bookmark Subscribe --- GA:UA-159972578-2 --- ######tags:`R``DataProcessing``資料前處理` #R常用技巧 OtherReference: 1.[Wrangling(RpubsNote)](http://rpubs.com/RitaTang/WranglingNote) 2.[Visualization](/@ritatang242/SJF_dAR37) 3.[Visualization(RpubsNote)](https://rpubs.com/RitaTang/Visualization) 4.[TextMining(Preprocessing)](/@ritatang242/B1qbL3yLL) 5.[TextMining(Chinese)](/PFrfv9NJSuy7frOV-wuJoA?both) 6.[PCA主成分分析(PrincipalComponentAnalysis)](/@ritatang242/r1lDWqiUL) #Package,initialization ```{r} pacman::p_load() rm(list=ls());gc() ``` #Encode ```{r} a=cbind(df$area,df$product_name) Encoding(a)='UTF-8'#只能放vector ``` #DealwithNA |符號|定義| |--------|--------| |NA|missingorundefineddata| |NULL|emptyobject(e.g.null/emptylists)| |Infand-Inf|positiveandnegativeinfinity| |NaN|resultsthatcannotbereasonablydefined| +刪除NA ```{r} df=na.omit(df) #equivalent:total%>%na.omit() mean(data,na.rm=T) ``` ```{r} a=airquality[,colSums(is.na(airquality))==0] #先找出誰是NA,再計算每個變數的NA總值來確認誰沒有NA,以此去除NA ``` +跳過NA ```{r} mean(data,na.action="na.pass") ``` +把NA變o ```{r} mx[is.na(mx)]=0 ``` +將NA變為一個類別 ```{r} Reduce(rbind,Map(function(x)read.csv(x,na.strings='.', stringsAsFactors=F)[,-1],c('cars1.csv','cars2.csv'))) #創造一個function做Map,一次讀取cars1和cars2兩個資料集(變成一個list) #並將值為"."的替換成NA #把這個list用rbind和reduce合在一起 ``` #DateFormat ```{r} as.Date("2018-10-10",format="%Y-%m-%d") as.character(r$BOARD_DATE)%>%as.Date(.,"%Y%m%d") as.Date(rail_df$BOARD_DATE%>%as.character(),format='%Y%m%d') ``` #DataFrameType +data.frame +預覽全部資料 +data.frame會將字串資料轉換成factor +data.table +data.table不會 +沒有rowname +語法使用list() ```{r} dt[,list(A,C)] #equivalent:df[,c(1,3)] ``` +tibble +方便查詢(可以直接預覽前10筆) +處理快速 #Read&Write +套件介紹 +library(foreign)可以處理非csv的檔案,如sas +library(XML) +library(DBI)讀取關聯式資料庫 +library(RMySQL)讀取SQL資料庫 +ReadData +fread(速度最快/大量處理) +csv +library(data.table) +read_csv(速度中等) +csv,xlsx,xls +library(readr) +read.csv(速度最慢) +csv +read_file +txt +library(readr) +read.table +txt ```{r} library(data.table) fread("TaiwanRailway.csv",sep='\t',encoding="big5") read.table("TaiwanRailway.csv",header=T,stringsAsFactors=F) xmlToDataFrame("Desktop/A_lvr_land_A.XML") ``` +WriteR.data ```{r} save(想儲存的變數名稱,file="../final.rdata")#將指定的環境變數保存 save.image()#saveeverthingincurrentenvironment load("../final.rdata") ``` +WriteData +fwrite +write.csv +saveXML +write.foreign ```{r} fwrite("TaiwanRailway.csv") write.csv("TaiwanRailway.csv") saveXML(xml,"test.xml") write.foreign() ``` #Function ```{r} predtr=predict(mod4,tr) MAE=function(n,y,y_hat){ return((1/n)*sum(abs(y-y_hat))) } MAE(nrow(tr),tr$Salary,predtr)#223.9275 #(1/nrow(tr))*sum(abs(tr$Salary-predtr))#223.9275 ``` #排序 +sort +回傳排序後(A-F)本身實際值 ``` x=c("D","A","C","F","B","E") sort(x,decreasing=F) ``` >[1]"A""B""C""D""E""F" +rank +回傳排序後(A-F)的排名,依照排序前的順序打印 ```{r} rank(x) ``` >[1]413625 >D:rank4,A:rank1 +oder +回傳排序後(A-F)在排序前的索引 ```{r} order(x,decreasing=F) ``` >[1]253164 >A:index[2],B:index[5] +arrange(checkDplyr) +只能對data.frame做 #Dplyr #Join +merge +left_join +right_join +inner_join +all_join +semi_join #SQL ```{r} sqldf::sqldf("select*fromA") where#filter的概念 groupby#group_by isnotnull#非NA值 orderbyxlimit5#order,照x排列取五個 ``` #apply系列 ##1.apply ```{r} apply(data,1,FUN) #margin:1means"row";2means"column" ``` ##2.lapply +取代迴圈 +回傳list ```{r} lapply(data,FUN) lapply(c(sum,mean,prob),FUN=function(f)f(data)) ``` ##3.sapply +取代迴圈 +回傳matrix ```{r} sapply(data,FUN) ``` ##4.tapply +對類別做function計算 ```{r} tapply(data,INDEX=iris$Species,FUN=distinct_counts) #INDEX:分類 #tapply()函數是融入table()函數功能的形式。

``` #aggregate ```{r} aggregate(price~cut+color,data=diamonds,mean) #根據某個cut和color進行分群,求price平均數 aggregate(x=mtcars$mpg,by=list("cyl"=mtcars$cyl),FUN=mean) #cylx #1426.66364 #2619.74286 #3815.10000 ``` #by ```{r} by(data=mtcars$mpg,INDICES=list("cyl"=mtcars$cyl),FUN=mean) #cyl:4 #[1]26.66364 #cyl:6 #[1]19.74286 #cyl:8 #[1]15.1 ``` #MapReduce ##Map ```{r} Map(FUN,df$x) ``` 與lapply頗像,讓完整的df中的不同obs.分別執行同樣的function,再一起存入一個list ```{r} genKPercentile=function(q1,q2,q3,q4){ pct=Map(function(x){function(y)quantile(y,x/100)},c(q1,q2,q3,q4)) names(pct)dt sexageweight 1m2745.6 2f2555.9 3m4049.0 4f2859.5 5m3853.5 6f3248.9 7m3645.9 8f2653.4 9m3248.4 10f3154.2 ``` 按性別分組,計算統計指標:平均年齡,平均體重,年齡標準差,體重標準差 ```{r} mapReduce(sex,mean(age),mean(weight),sd(age),sd(weight),data=dt) #[,1][,2][,3][,4] #f28.454.383.0495903.858368 #m34.648.485.1768723.179151 ``` #Identifyitem ##identical 辨認兩者是否完全相等 ```{r} identical(AY$cust_id,subset(AX,train)$cust_id) #[1]TRUE ``` ##unique 找出不重複的值(組合) ```{r} unique(cust[,c(1:3)])#看cust第一到第三的欄位共有幾種pattern unique(cust$cust_id)#找不重複的所有cust_id ``` ##duplicated 一個個判斷是否為重複,第一次出現是新值,第二次再出現一樣的即為重複。

```{r} a=18~"Adult", Age<18~"Child", is.na(Age)~"Unknown") ``` #switch ```{r} switch(指定執行第幾行/哪個名稱的程式碼, 第一行:做A, 第二行:做B, 第三行:做C,...) switch("first",first=1+1,second=1+2,third=1+3) #[1]2 switch(3,"first","second","third","fourth") #[1]"third" ``` 1 × Signin Email Password Forgotpassword or Byclickingbelow,youagreetoourtermsofservice. SigninviaFacebook SigninviaTwitter SigninviaGitHub SigninviaDropbox NewtoHackMD?Signup



請為這篇文章評分?